Abstract
Background
Accurate structure prediction methods play an important role for the understanding of RNA function. Energybased, pseudoknotfree secondary structure prediction is one of the most widely used and versatile approaches, and improved methods for this task have received much attention over the past five years. Despite the impressive progress that as been achieved in this area, existing evaluations of the prediction accuracy achieved by various algorithms do not provide a comprehensive, statistically sound assessment. Furthermore, while there is increasing evidence that no prediction algorithm consistently outperforms all others, no work has been done to exploit the complementary strengths of multiple approaches.
Results
In this work, we present two contributions to the area of RNA secondary structure prediction. Firstly, we use stateoftheart, resamplingbased statistical methods together with a previously published and increasingly widely used dataset of highquality RNA structures to conduct a comprehensive evaluation of existing RNA secondary structure prediction procedures. The results from this evaluation clarify the performance relationship between ten wellknown existing energybased pseudoknotfree RNA secondary structure prediction methods and clearly demonstrate the progress that has been achieved in recent years. Secondly, we introduce AveRNA, a generic and powerful method for combining a set of existing secondary structure prediction procedures into an ensemblebased method that achieves significantly higher prediction accuracies than obtained from any of its component procedures.
Conclusions
Our new, ensemblebased method, AveRNA, improves the state of the art for energybased, pseudoknotfree RNA secondary structure prediction by exploiting the complementary strengths of multiple existing prediction procedures, as demonstrated using a stateoftheart statistical resampling approach. In addition, AveRNA allows an intuitive and effective control of the tradeoff between false negative and false positive base pair predictions. Finally, AveRNA can make use of arbitrary sets of secondary structure prediction procedures and can therefore be used to leverage improvements in prediction accuracy offered by algorithms and energy models developed in the future. Our data, MATLAB software and a webbased version of AveRNA are publicly available at http://www.cs.ubc.ca/labs/beta/Software/AveRNA.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Background
RNAs are amongst the most versatile and oldest biomolecules; they play crucial roles in many biological processes. As in the case of proteins, the function of many types of RNAs critically depends on the threedimensional structure of the molecules. However, the 3D structure of RNAs is determined to a larger degree by their secondary structure, which arises from basepairing interactions within an RNA strand and stacking of the resulting base pairs.
Since the direct determination of 3D structures is difficult and costly, computational structure prediction methods, and in particular, secondary structure prediction methods, are widely used. A prominent and versatile approach for predicting RNA secondary structures is based on thermodynamic models, such as the Turner model [1], and uses dynamic programming algorithms (such as the Zuker & Stiegler algorithm [2]), to find a structure with minimum free energy (MFE) for a specific RNA sequence. Over the last five years, considerable improvements in the predictions obtained by such algorithms have been achieved.
It is important to note that, while it might seem natural to use experiments to determine the parameters of a thermodynamic model and machine learning and optimisation to determine those of a stochastic model, because of the equivalence between the free energy and probability of RNA structures, in principle, both approaches can be used in either setting. Indeed, the largest improvement in prediction accuracy has resulted from the use of sophisticated methods for estimating the thermodynamic parameters of a given energy model (in particular, the Turner model), based on a set of reliable RNA secondary structures [3–5]. Particularly good results have been achieved for methods in which parameter estimation additionally takes into account thermodynamic data from optical melting experiments, such as CG, LAMCG and BL, [4, 5] and expand the standard energy model with probabilistic relationships between structural features (e.g., hairpin loops of different lengths), such as BLFR [5]. Improved prediction accuracy has also been reported for an approach that determines structures with maximum expected accuracy (MEA) rather than minimum free energy, based on base pairing probabilities obtained from a partitionfunction calculation [3, 6, 7]. CONTRAfold implements a conditional loglinear model (which generalizes upon stochastic contextfree grammars) for structure prediction. MaxExpect starts from basepair probabilities calculated by partition functions [8] and uses dynamic programming (similar to CONTRAfold) to predict the MEA structure [6]. And finally, CentroidFold uses a similar strategy except that it uses a weighted some of true positives and true negatives as the objective function [7].
While the overall improvement in accuracy achieved over the baseline provided by the Zuker & Stiegler algorithm using the Turner model is clearly significant, there is less certainty about the more modest performance relationships between some of the more recent methods. For example, Lu et al. reported a difference of only 0.3% in average sensitivity between their MaxExpect procedure and CONTRAfold 2.0[3]. Similarly, Andronescu et al. found a 0.5% difference in average Fmeasure between CONTRAfold 2.0 and their CG ^{∗} procedure [5]. Whether such small performance differences can be considered significant is an open question; in fact, a crossvalidation experiment for the BL and LAMCG parameter estimation methods suggests that 3% differences in accuracy may be statistically significant, but the evidence is far from conclusive [5]. This suggests that there is a need for methods that make it possible to assess the statistical significance of differences in prediction accuracy observed between RNA secondary structure prediction methods. In this work we present such methods, based on two wellestablished resampling techniques from statistics, bootstrapped confidence intervals and permutation tests. Using these methods and a wellstudied, large set of trusted RNA secondary structures, we assess progress and the state of the art in energybased, pseudoknotfree RNA secondary structure prediction.
Also, it has been demonstrated that the accuracies of predictions based on their BL ^{∗}, CG ^{∗} and Turner99 parameter sets (see their Supplementary Results C) are not consistent across large and diverse sets of RNAs, and that differences in accuracy for many individual RNAs often deviate markedly from the average accuracy values measured across the entire set [5]. This suggests that by combining the predictions obtained from different procedures, better results can be achieved than by using any one of the given procedures in isolation. This general idea has been previously applied to a wide range of problems in computing science (where it underlies the fundamental approaches of boosting and bagging [9]). More recently, it has been used successfully for solving various problems from computational biology, including gene prediction [10], clustering proteinprotein interaction networks [11], as well as analysis of data from microarrays [12] and flow cytometry [13].
Here, we introduce a generic RNA secondary structure prediction procedure that, given an RNA sequence, uses an ensemble of existing prediction procedures to obtain a set of structure predictions, which are then combined on a perbasepairbasis to produce a combined prediction. Empirical analysis demonstrate that this ensemblebased prediction procedure, which we dub AveRNA, outperforms the previous stateoftheart secondary structure prediction procedures on a broad range of RNAs. On the SSTRAND2 dataset [14], AveRNA obtained an average Fmeasure of 71.6%, compared to the previous best value of 70.3% achieved by BLFR ^{∗}[5]. AveRNA can easily be extended with new prediction procedures; furthermore, it provides an intuitive way of controlling the tradeoff between false positive and false negative predictions. This is useful in situations where high sensitivity or high PPV may be required and allows AveRNA to achieve a sensitivity of over 75% and a PPV of over 83% on SSTRAND2.
Methods
In this section, we first describe the data set and prediction accuracy measures used in our work. Next, we introduce the statistical methodology for the empirical assessment of RNA secondary structure prediction algorithms we developed in this work. This is followed by a brief summary of the set of procedures for MFEbased pseudoknotfree RNA secondary structure prediction we used in this work. Finally, we present AveRNA, our novel RNA secondary structure prediction approach, which combines predictions obtained from a diverse given set of procedures by means of weighted perbasepair voting.
Data sets
In this work, we used the SSTRAND2 dataset [14], which consists of 2518 pseudoknotfree secondary structures from a wide range of RNA classes, including ribosomal RNAs, transfer RNAs, transfer messenger RNAs, ribonuclease P RNAs, SRP RNAs, hammerhead ribozymes and group 1/2 introns [15–20]. This large and diverse set is comprised of highly accurate structures and has been used for the evaluation of secondary structure prediction accuracy in the literature [5]. For the parts of our work involving the optimization of prediction accuracy, in order to avoid overfitting, we used a subset of the SSTRAND2 dataset obtained by sampling 500 structures uniformly at random as the basis for the optimization process, and the full SSTRAND2 dataset for assessing the resulting, optimized prediction procedures.
Existing secondary structure prediction methods
We used 10 secondary prediction procedures known from the literature. The SimFold V2.0 procedure [21], which is based on Zuker and Stiegler’s classic dynamic programming algorithm, was used to predict secondary structures using six different sets of free energy parameters: Turner99 [1]; NOMCG [4]; DIMCG [22]; CG ^{∗}, BL ^{∗} and BLFR ^{∗}[5]. Furthermore, we used CONTRAfold v1.1, CONTRAfold v2.0 [3], MaxExpect v5.1 [6] and CentroidFold v0.0.9 [7]. The two versions of CONTRAfold as well as CentroidFold are based on probabilistic methods that do not make use of physically plausible thermodynamic models of RNA secondary structure, while the seven other procedures are all based on (variations of) the widely used free energy model by the Turner group [1].
While we originally also considered taveRNA [23] and SARNAPredict [24], it turned out to be infeasible to run these procedures on the the longer sequences from the SSTRAND2 dataset (due to runtime and memory requirements).
Accuracy measures
Consistent with existing work on RNA secondary structure prediction, we assessed the prediction accuracy achieved by a given RNA secondary structure prediction procedure based on a given set of references structures, using sensitivity, positive predictive value (PPV) and the Fmeasure. We define a correctly predicted basepair to be a predicted basepair, exactly identical to one of the basepairs in the reference structure. For a single RNA (sequence, structure) pair, sensitivity is the ratio of number of correctly predicted basepairs to the number of basepairs in the reference structure:
PPV is the ratio of number of correctly predicted basepairs to the number of basepairs in the predicted structure:
and the Fmeasure is defined as the harmonic mean of sensitivity and PPV:
If there are no basepairs in the predicted structure and the reference structure, we define PPV and Sensitivity to be 1 and otherwise 0. The Fmeasure, sensitivity, and PPV for the prediction of any individual structure are always in the interval [ 0,1], where 1 characterizes a perfect prediction. When assessing the prediction accuracy on a given set of structures, we usually report the average Fmeasure, sensitivity, and PPV achieved over the entire set.
Statistical analysis of prediction accuracy
To formally assess the degree to which prediction accuracy results measured for a given set of RNAs depend on the precise choice of this set, we employ two wellknown statistical resampling techniques: bootstrap confidence intervals and permutation tests (see, e.g., [25]). Details on the respective procedures developed and used in the context of this work are described in the following. Here, we applied these statistical analysis procedures to the average Fmeasure determined for predictions on a given set of RNAs, but we note that they generalize in a straightforward manner to other measures of accuracy and other statistics of these over the given benchmark set. We note that these statistical techniques are limited to assessing the impact of different samples from the same underlying distribution – an important issue, considering the large variation of prediction accuracy within the sets of RNAs commonly used for evaluating structure prediction procedures – but do not allow assessment of prediction accuracy might vary as the underlying distribution is changed (e.g., by modifying the relative representation of RNAs from different families or of different provenance); to address this latter question, we use a different approach described later.
To investigate the consistency of predictions obtained from different RNA secondary structure prediction procedures, we used scatter plots as well as the Spearman correlation coefficient (which, unlike the more widely used Pearson product moment correlation coefficient, also captures nonlinear relationships).
Bootstrap percentile confidence intervals
Following common practice (see, e.g., [25]), for a vector f of Fmeasure values achieved by a given prediction procedure on the structures contained in a given set Sof RNAs (here, SSTRAND2), we perform the following steps to determine the 95% confidence interval (CI) for the mean Fmeasure:

(1)
Repeat for 10^{4} times: from f, draw a uniform random sample of size f with replacement, and calculate the average Fmeasure of the sample.

(2)
Report the 2.5t h and 97.5t h percentiles of the distribution of Fmeasures from Step 1 as the lower and upper bounds of the CI, respectively.
The choice of 10^{4} samples in Step 1 follows standard practice for bootstrap CIs (as illustrated by the results shown in Figure S2 in the Supporting Information, the results obtained using different sample sizes are very similar).
Permutation test
Following common practice (see, e.g., [25]), for vectors f_{ A } and f_{ B } of Fmeasure values achieved by given prediction procedures A and B, respectively, on the structures contained in a given set S of RNAs (here, SSTRAND2), we perform the following steps to perform a permutation test for the null hypothesis that the mean Fmeasure achieved by A and Bis the same:

(1)
Repeat for 10^{4} times: For each RNA in S, swap the Fmeasures of A and B with probability 1/2, resulting in vectors {\mathbf{f}}_{A}^{\phantom{\rule{0.3em}{0ex}}\prime} and {\mathbf{f}}_{B}^{\phantom{\rule{0.3em}{0ex}}\prime}, respectively.

(2)
Construct the cumulative distribution function (CDF) of \mathit{\text{avg}}\left({\mathbf{f}}_{A}^{\phantom{\rule{0.3em}{0ex}}\prime}\right)\mathit{\text{avg}}\left({\mathbf{f}}_{B}^{\phantom{\rule{0.3em}{0ex}}\prime}\right) from the 10^{4} permuted pairs of vectors {\mathbf{f}}_{A}^{\phantom{\rule{0.3em}{0ex}}\prime}, {\mathbf{f}}_{B}^{\phantom{\rule{0.3em}{0ex}}\prime} from Step 1, where a v g(·) denotes the average over the elements of a given vector.

(3)
Determine the percentile c of the CDF from Step 2 that is equal to a v g(f_{ A })−a v g(f_{ B }), as determined from the original, unpermuted performance vectors for A and B; p=(100−c)/100 is the pvalue of the test.

(4)
Reject the null hypothesis of equal performance if, and only if, p from Step 3 is smaller than a given significance threshold α.
The choice of 10^{4} repetitions in Step 1 follows standard practice for permutation tests. In this work, we used this test with a standard significance threshold of α=0.05.
AveRNA
As explained earlier, the key idea behind AveRNA is to exploit complementary strengths of a diverse set of prediction algorithms by combining their respective secondary structure predictions for a given RNA sequence. Our AveRNA procedure can make use of an arbritrary set of prediction procedures, A:={A_{1},A_{2},…,A_{ k }}, which it uses in a blackbox manner to obtain predictions for a given input sequence, s. To emphasise the fact that the subsidiary structure prediction procedures in A are effectively just an input to AveRNA that can be varied freely by the user, we use AveRNA (A) to denote AveRNA with set A of component structure prediction procedures.
Applied to input RNA sequence s, AveRNA (A) first runs each A_{ l }∈A on s, resulting in predicted structures S(A_{1},s),S(A_{2},s),…,S(A_{ k },s). Let each of these structures S be represented by a basepairing matrix B P(S) defined by B P(S)_{i,j}=1 if i and j form a basepair in S and B P_{i,j}=0 otherwise, where i,j∈{1,2,…,n}. We note that any RNA secondary structures, pseudoknotted or not, corresponds to exactly one such binary matrix, but not every binary matrix represents a valid secondary structure.
We now consider the normalised sum of these binary matrices:
Each entry B_{i,j} of this matrix can be interpreted as the probability of a base pair between bases i and j in input sequence s, under the assumption that the predictions obtained from each of the A_{ l } should be considered equally likely to be correct. This is equivalent to tallying votes for each possible base pair, where each predictor has one vote per candidate pair i,j.
However, it may well be that some predictors are generally more accurate than others, as is known to be the case for the set of secondary structure predictors we consider in this work. Therefore, we associate a weight (in the form of a real number between 0 and 1) with each predictor and consider the weighted normalised sum of the individual secondary structure matrices:
where w=(w_{1},w_{2},…,w_{ k }), each w_{ l } is the weight assigned to predictor l, and \sum _{i=1}^{k}{w}_{i}=1. We note that the unweighted case from above corresponds to w_{ l }=1/k for each l. Before discussing the interesting question of how to determine appropriate weights, we describe in the following how we infer the pseudoknotfree RNA structure ultimately returned by AveRNA from the entries in the weighted probability matrix P(w).
Structure inference
The final structure prediction returned by AveRNA (A) for a given sequence can be obtained in different ways. First, we note that the problem of extracting a pseudoknotfree structure from the resulting probability matrix can be solved using a Nussinovstyle dynamic programming (DP) algorithm to infer maximum expected accuracy (MEA) structures [6]. We refer to the variant of AveRNA that uses this procedure as AveRN A_{ D P }. Unfortunately, this DP procedure requires Θ(n^{3}) running time, which becomes problematic in the context of the parameter optimisation described later. Therefore, we designed the following greedy algorithm as an alternative way for estimating MEA structures. Let p=(p_{1},p_{2},…) be the sorted list of basepair probabilities in P(w) in decreasing order and V=(v_{1},v_{2},…) be the respective set of basepairs. For a given threshold θ (a parameter of the procedure whose value we discuss later), we begin with an empty set of basepairs S, set i:=1, and repeat as long as p_{ i }≥θ: (1) Add v_{ i } to S if (and only if) it is compatible with all other pairs in S, i.e., does not involve a base already paired with another position or introduce a pseudoknot in S; (2) increment i. We refer to the variant of AveRNA using this greedy inference method as AveRN A_{ Greedy }.
We note that, while the greedy inference method is not guaranteed to find a MEA structure, as we will show later, it performs very well compared to the exact DP inference algorithm and is computationally much more efficient. When either variant of AveRNA is applied to a set of RNA sequences, prediction and structure inference are performed for each given RNA sequence independently, and the results are independent of the composition of the set or the order in which sequences are considered.
Parameter optimization
Clearly, the performance of AveRNA (A) depends on the set A of component prediction procedures as well as on the previously mentioned parameters, namely the weights w_{ l } and, for A v e R N A_{ G r e e d y }, the pairing threshold θ. Before using AveRNA (A) for prediction tasks, we would like to find settings for these parameters that would result in optimised prediction accuracy obtained on a set of reference RNAs (in terms of mean Fmeasure over the set). We solved the resulting numerical optimisation problem using a wellknown procedure called covariance matrix adaptation evolution strategy (CMAES)[26, 27]. CMAES is a nonconvex, gradientfree parameter optimization procedure that has proven to be empirically successful in many realworld applications and appeared to be the most appropriate tool for finding performanceoptimising parameters of AveRNA. We used the MATLAB implementation of CMAES with default settings, except that we had to increase the maximum number of iterations to 100, since in some cases we observed clear evidence that a global optimum was not reached with the lower default setting for this parameter [28].
Time complexity
The running time required to run AveRNA (A) (with a fixed set of parameters) is essentially the sum of the running times of the component prediction procedures A_{1},…,A_{ k } (where we note that in principle, these can be run in parallel and the time required for inferring the output structure from these results). While for AveRN A_{ DP }, the latter time is of order Θ(n^{3}), and therefore no worse than the complexity of most RNA secondary structure prediction methods based on dynamic programming, for AveRN A_{ Greedy }, it is O(n^{2}) in the (unrealistic) worst case and negligible in practice.
Parameter optimisation requires substantially more computational effort, but has to be performed only once, offline, very much like optimisation of the parameters of a given energy model. In the context of AveRNA_{ DP }, each iteration of this optimisation process involves running the Θ(n^{3}) DP procedure on all sequences in the given training set of RNAs, and as we will demonstrate later, it turns out to be important to use reasonably large and diverse training sets. In our experiments, using a training set of 500 sequences, one iteration of CMAES on AveRNA_{ DP } took 653 250 seconds (i.e., more than 750 CPU days for the full optimization). Each iteration of optimising AveRNA_{ Greedy }, on the other hand, took only 2 880 seconds (i.e., the full optimization required less than 4 CPU days). Note that these runtimes do not include the time required by the individual algorithms for predicting the structures, which are the same for both approaches and need to be expended only once at the beginning of the optimisation process. Once the parameters of AveRNA are optimised, it runs efficiently, as described at the beginning of this section.
Ablation analysis
Measuring the contribution of each algorithm to AveRNAs performance can help us answer a wide range of questions, including the following: Which component prediction procedure contributes the most to the overall performance of AveRNA? Is there a certain number of component prediction procedures that must be included before the ensemble method outperforms the individual ones? Are there prediction procedures that can compensate for each other, in the sense that including one procedure from a certain set is important, but adding others from the same set does bring significant further gains? For AveRNA (A) with A={A_{1}, A_{2},...,A_{ k }} we assessed the contribution of each A_{ l }using the following ablation procedure:

(1)
Determine the A_{ l }∈A for which AveRNA (A∖{A_{ l }}) performs worst^{1}, i.e., whose average Fmeasure on the give set of RNAs is lowest.

(2)
Remove A_{ l } from Step 1 from A.

(3)
If A still contains more than two algorithms, go to Step 1 and iterate.
Step 1 involves reoptimising the parameters of AveRNA for each set of component algorithms, starting from the values of AveRNA (A).
Results
In our computational experiments, we pursued two major goals: firstly, to critically assess the state of the art in predicting pseudoknotfree MFE RNA secondary structures, and secondly, to demonstrate that our AveRNA ensemblebased structure prediction method does indeed achieve significantly better results than previous algorithms.
Performance of existing prediction methods
Table 1 shows the the mean Fmeasure value for each method on the SSTRAND2 dataset, along with bootstrap confidence intervals calculated as explained in the previous section, which are also shown graphically in Figure 1. Table 2 shows the results (pvalues) obtained from permutation tests for each pair of methods. As can be seen from this table, the only statistically insignificant performance differences were observed between T99 and CONTRAfold 1.1, and between CONTRAfold 2.0 and NOMCG.
Consistent with previous work [5], we found that the oldest algorithm, T99, achieves a mean Fmeasure just below 0.6. CONTRAfold 1.1 performs slightly better than T99 on our benchmark set, but the performance advantage is not statistically significant; we believe that the reason for this lies primarily in the fact that it was trained on a small set of RNAs not representative of the broad range of structures found in SSTRAND2. MaxExpect and Centroidfold do perform significantly better than T99, but fall short of the performance achieved by CONTRAfold 2.0. The latter method was trained on the SSTRAND2 dataset, which partly explains why it, exactly like NOMCG, achieves an average Fmeasure that is 0.026 higher than that of CONTRAfold 1.1.
The methods recently developed by Andronescu et al., DIMCG, CG^{∗}, BL^{∗} and BLFR^{∗}, each achieve significantly better performance than any of the previously mentioned methods; although the confidence intervals obtained for these methods show some overlap, the respective differences in mean Fmeasure are all significant. The best of these methods, BLFR^{∗}, represents an improvement of more than 0.1 in average Fmeasure over T99, and of almost 0.05 over CONTRAfold 2.0.
Performance correlation
For an ensemblebased approach like AveRNA to work well, the set of component prediction algorithms need to have complementary strengths, as reflected in lessthan perfect correlation of prediction accuracy over sets of RNA sequences. As can be seen in Table 2, the pairwise performance correlation between the procedures we considered in our study is not very strong (as indicated by Spearmann corelation coefficients between 0.66 and 0.86). Figures 2 and 3 illustrate this further by showing the correlation in Fmeasure across our set of RNAs for the two pairs of algorithms whose average performance does not differ significantly, T99 and CONTRAfold 1.1, and CONTRAfold 2.0 and NOMCG, respectively. (In these scatter plots, each data point corresponds to one RNA from our SSTRAND2 set.)
Performance of AveRNA
After optimizing the weights on our training set of RNAs, we found that there was no statistically significant difference between the predictions of A v e R N A_{ D P } and A v e R N A_{ G r e e d y } on the SSTRAND2 set (as determined using a permutation test, which yielded a pvalue of 0.51). Because of its substantially lower runtime requirements, especially during training, we therefore decided to focus on A v e R N A_{ G r e e d y } for the remainder of our study, and we refer to this variant simply as AveRNA.
As can be seen in Table 1, AveRNA achieved an average Fmeasure of 0.716 on SSTRAND2, compared to 0.703 obtained by the best previous method, BLFR^{∗}.
Moreover, even when assessing AveRNA on a test set obtained by excluding the 500 sequences used for parameter optimisation from SSTRAND2, it achieves significantly higher prediction accuracy than any of its constituent algorithms. We note that although this performance improvement might appear to be modest, it is about as much as the difference between BL^{∗} and BLFR^{∗} and, according to a permutation test, statistically highly significant (see Table 3).
To study AveRNA’s performance on sets of RNAs of different types and provenance, we optimised the parameters for AveRNA on subsets of SSTRAND2, from which one of the 7 classes that make up the RNA STRAND database had been excluded, and then tested on the excluded class only, such that there was not only no overlap between training and test set, but also very little similarity. This is a situation where many machine learning techniques are known to perform quite poorly. The results from this experiment, shown in Table 4, indicate clearly that, even in this very challenging setting, AveRNA performs very well: only on 2 of the 7 classes, AveRNA performs significantly worse if trained under exclusion of that class, and in the two remaining cases, the loss in accuracy was only about 2% (Additional file 1: Table S1 for detailed results from the respective permutation tests).
We further note that, as per the results shown in Table 4, prior to AveRNA, the best energybased prediction algorithm varied between RNA classes. On the other hand, AveRNA was found to not perform significantly worse than the previous best method on any of the 7 classes, and in 2 of them (CRW and RFA  see Additional file 1: Table S1), it performed significantly better. This suggests (but of course cannot guarantee) that AveRNA is likely to perform at least as well as other general purpose energybased secondary structure prediction algorithms on previously unseen classes of RNAs. Furthermore, we also optimised AveRNA on a small part of each of the 7 classes and then evaluated it on the entire class; the results of this experiment, also shown in Table 4, indicate that by training a generic version on the broader set of sequences described earlier gives surprisingly good and robust performance – only for 3 of the 7 classes (ASE, SPR, and SRP) the respective classspecific version of AveRNA performs significantly better and in one class (PDB) it performs worst. Table 4 also shows the mean sequence length for every class of RNAs and provides clear evidence that AveRNA’s performance relative to its constituent algorithms does not deteriorate with increasing sequence length.
One interesting property of AveRNA (A) is that the tradeoff between sensitivity and PPV can be easily and intuitively controlled by the threshold θ∈ [ 0,1]: For high θ, only base pairs are predicted for which there is high agreement between the procedures in A, and therefore, we expect relatively few false positive predictions at the cost of relatively many false negatives, while for low θ, even base pairs predicted by very few procedures in A tend to be included in the overall prediction, leading to relatively many false positive, but few false negatives. CONTRAfold 1.1, CONTRAfold 2.0, Centroidfold and MaxExpect also afford control of this tradeoff, via the parameter γ∈ [−5,6], but in a less intuitive manner.
Figure 4 illustrates the tradeoff between sensitivity and PPV for all of these algorithms and shows clearly that overall, AveRNA dominates all previous methods, and in particular, gives much better results than the previous best algorithm that afforded control over this tradeoff, CONTRAfold 2.0. We note that, in all cases, as a procedure becomes increasingly more conservative in predicting base pairs, eventually, both sensitivity and PPV drop (see Additional file 1: Figure S1); we believe this to be a result of the high detrimental impact of even a small number of mispredicted base pairs when overall very few pairs are predicted.
Ablation analysis
The results of the ablation analysis we conducted to study the relative impact of the various component prediction procedures in A on the performance of AveRNA (A) are shown in Table 5. The top 11 rows contain the weights assigned to each algorithm; cases in which a procedure from A was dropped during the optimisation process are indicated by a value of zero. The bottom three rows show the value of threshold θ and the average performance on the training and test sets, respectively.
It is interesting to note that although BLFR^{∗} has a weight of over 40% in the full ensemble, excluding it leads to a rather modest drop of only 0.011 in average Fmeasure, and this drop in performance is the highest caused by removing any single procedure from the full set A. Similarly, the decreases in performance as additional procedure are removed, are mostly quite small. This indicates that, within the set of prediction procedures we considered here, there is not only sufficient complementarity in the strength of individual procedures to obtain benefits from the ensemblebased approach, but also enough similarity in strength between some of the procedures to permit compensating for the removal of one by increasing the weight of others.
As seen in Table 5, up to the point where only one procedure is left in A, the performance of AveRNA (A) is always higher than that of any of its constituents, indicating the efficacy and robustness of our ensemblebased prediction approach.
Training set selection
Clearly, AveRNA’s performance depends on the training set that is used as a basis for optimising its weight parameters. To study the effect of training set size on performance (in terms of mean Fmeasure), we generated 11 training sets of size 100 and 200, as well as one training set of size 500 and one set of size 1000. We then optimised AveRNA (A) for each of these sets and measured the performance obtained on the full SSTRAND2 test set. As can be seen from the results of this experiment shown in the Table 6, decreasing the training set size from 500 to 200 lead to a modest drop in mean Fmeasure by 0.004, and further decrease to 100 caused a larger drop by 0.007. On the other hand, increasing the size of the training set from 500 to 1000 merely resulted in a very small performance improvement of less than 0.001. This indicates that, while it is important to use a reasonably large and diverse training set, at least for the set of prediction procedures considered here, there is only very limited value in using sets larger than that of size 500 we used for all other experiments.
We note that we did not use the training set developed by Andronuescu et al. (2010) in the context of energy parameter estimation, primarily because many of the prediction procedures we study here have been optimised on that set (which could have biased AveRNA to assign higher weights to those algorithms and lead to poor generalization to test data). We also note that all training sets we considered were obtained by random uniform sampling from the full SSTRAND2 set.
Additionally, in Table 2 we have reported the Fmeasures of testset2, a new testset which consists of all members of SSTRAND2 which have not been used by AveRNA or any of the individual algorithms for training purposes. Permutation tests on this new test set (Table S2) confirm that AveRNA remains significantly more accurate than the other algorithms.
Discussion
To no small extent, our work presented here was motivated by the observation that in many cases, the differences in accuracy achieved by RNA secondary structure prediction methods are quite small on average, but tend to vary very significantly between individual RNAs [5, 6]. While this is not surprising, it suggests that care should be taken when assessing different prediction methods to ensure statistically meaningful results, and that potentially, benefits could be derived from combining predictions obtained from different methods. The statistical procedures we use in this work make it possible to assess statistical significance in a wellestablished, quantitative and yet computationally affordable way, and our AveRNA procedure provides a practical way for realising the benefits inherent in a set of complementary prediction methods.
Our results demonstrate that there has, indeed, been steady progress in the prediction accuracy obtained from energybased RNA secondary structure prediction methods. The fact that CONTRAfold 1.1 provides no statistically significant improvement in accuracy over the standard T99 energy model when both are evaluated on our large and diverse set of reference structures needs to be viewed in light of the fact that CONTRAfold 1.1 was trained on a limited set of RNA structures from the RFam database. The fact that CONTRAfold 2.0, which was trained on the the same larger and richer set used by Andronescu et al.[4], performs much better further highlights the importance of the training set used as a basis for empirically optimising the performance of prediction methods. It is interesting to observe that the performance difference between CONTRAfold 2.0 and NOMCG, which are trained on the same set of references structures, are insignificant, which indicates that both methods are equally effective in making use of the information inherent in this set. However, NOMCG, thanks to its additional use of thermodynamic data, produces a physically plausible energy model, while the probabilistic model underlying CONTRAfold 2.0 does not produce realistic free energy values.
We further interpret the fact that DIMCG, CG^{∗}, BL ^{∗} and BLFR^{∗} all perform significantly better than CONTRAfold 2.0 as evidence that the thermodynamic data used by the former methods can effectively inform methods for optimising prediction accuracy based on data. Our statistical analysis provides further support for the claim that the computationally more expensive Boltzmann Likelihood parameter estimation method leads to better results than the Constraint Generation method, and that the additional use of probabilistic feature relationships enables further significant improvements [5].
The accuracy results we obtained for the MaxExpect procedure [6] and for Centroidfold[7] are markedly lower than those reported in the respective original studies, mainly because our evaluation is based on a more extensive set of reference structures. However, we note that the underlying approaches of maximizing expected basepair accuracy and γ−centroid estimators can in principle be applied to any prediction method that produces probability distributions over the secondary structures of a given sequence. We therefore expect that these ideas can eventually be used in combination with parameter estimation methods, such as the ones that gave rise to the CG ^{∗}, BL^{∗} and BLFR^{∗} parameter sets.
The results of our correlation analysis revealed that prediction methods whose accuracy over the entire benchmark set does not differ much (such as T99 and CONTRAfold 1.1) show large differences in accuracy on many individual RNAs. Consistent with earlier observations that predictions that are slightly suboptimal according to a given energy model can sometimes be much more accurate (see, e.g., [6]), we conjecture that this is a consequence of systematic weaknesses (such as the lack of accounting for interactions between nonneighbouring bases or the use of an overly simplistic energy model for multiloops) and inaccuracies (for example, in thermodynamic measurements) in the energy models underlying these procedures. Particularly when using automated methods for optimising the parmaters of a given energy models, such weaknesses and inaccuracies could easily lead to multiple solutions that show similar performance on average, but give very different results on many individual RNAs.
This situation, while at the first glance somewhat unsatisfactory, provides the basis for our AveRNA approach, which obtains more accurate predictions by means of weighted combination of the predictions obtained from a set of given prediction procedures. While our study is focussed on the prediction of pseudoknotfree MFE structures, we note that the weighted sum calculation performed by AveRNA on base pairing matrices naturally extends to methods that produce base pairing probabilities and to pseudoknotted prediction methods. In the latter case, the calculation of the weighted probability matrix P(w) proceeds exactly as in the pseudoknotfree case, but the procedure used for structure inference would have to be modified to produce pseudoknotted MEA structures. In the former case, probability matrices are used instead of Boolean matrices, and the result of the calculation would be normalised to yield a wellformed base pairing probability matrix. (We note that, in light of very recent empirical results based on the statistical approach first developed in the context of the work presented here, it is not clear that MEA structures determined from individual base pairing probability matrices are generally more accurate than MFE structures for the same energy model [29]; however, it is possible that higher accuracies can be obtained via ensemblebased MEA predictions from weighted combinations of multiple base pairing matrices.) We pursued neither of these directions here, because currently, the number of highaccuracy prediction procedures for pseudoknotted RNA structures of basepair probabilities is more limited and because the development and assessment of extensions of AveRNA to those cases pose challenges that are beyond the scope of this work, but we strongly believe that these directions are very promising and should be explored further in the future.
We note, however, that AveRNA as presented here already realises an advantage usually found only in approaches that produce base pairing probabilities: an easy and intuitive way for assessing the confidence with which certain bases are predicted to pair or remain unpaired, by means of inspecting the entries of the probability matrix P(w). Values close to one indicate base pairs that are predicted consistently by many of the underlying prediction procedures, and values close to zero indicate bases that are consistently predicted to be unpaired. Intermediate values indicate base pairings for which there is more disagreement between the given prediction procedures. From the fact that by thresholding these values, the sensitivity and specificity (PPV) for predicting base pairs can be increased quite substantially (as seen in Figure 3), we conclude that the set of prediction procedure used by AveRNA in this work is sufficiently diverse to allow for this interpretation. The threshold parameter θ controls the tradeoff between sensitivity and PPV in an intuitive way. It is conceivable that even higher sensitivity and PPV values can be obtained by optimising the weight parameters of AveRNA specifically for that purpose (something we did not attempt in this work).
Conclusions
The ensemblebased RNA secondary structure prediction method AveRNA introduced in this work not only improves over existings stateoftheart energybased methods, but also holds much promise for the future. AveRNA can make use of arbitrary secondary structure prediction procedures; in particular, as demonstrated here, it can be used to combine both MEA and MFE structures. We expect that by adding new prediction procedures to the set used by AveRNA, even better ensemblebased predictions can be obtained. It is conceivable that eventually, a prediction procedure becomes available that dominates all previous methods, in the sense that it provides predictions as least as accurate as these on all RNAs of interest, and in that case, the ensemblebased prediction approach of AveRNA would not realise any additional gains. Based on our assessment of existing methods, and considering the weaknesses and inaccuracies known to exist in all current energy models, we do not expect this situation to arise in the foreseeable future. The results of our ablation analysis further supports the view that further increases in prediction accuracy achieved by the ensemblebased prediction approach underlying AveRNA are likely to arise as new prediction procedures become available, since – as seen in Table 5 – that was the case when adding new procedures to sets of previously known procedures in the past.
In fact, BLFR^{∗} was introduced when AveRNA was under development and achieved an Fmeasure of higher than the version of AveRNA available at that time. Including BLFR^{∗} in AveRNA produced the version of AveRNA studied here, which – as expected – performs significantly better than BLFR^{∗}. This suggests that AveRNA not only represents the state of the art in secondary structure prediction at the time of this writing, but is likely to remain so, as improved prediction algorithms and energy models are developed and added to the generic ensemblebased approach.
It should be noted, however, that in cases where additional information about the specific secondary structure of a particular RNA is available (e.g., in the form of SHAPE or other footprinting data), prediction methods that utilise this information should be expected to achieve higher accuracies (see, e.g., [30]).
We see several avenues for future work: Here, we focused on pseudoknot free structures, but the general framework (except the dynamic programming) can be applied to pseudoknotted structures as well once a wider range of these algorithms are developed. Similarly, our framework can be applied to algorithms that are able to calculate basepair probabilities (e.g., based on partition functions) or to algorithms that are able to predict several suboptimal structures. New algorithms (e.g., nonenergybased methods) or different configurations of the existing algorithms (using different training strategies) can be included in AveRNA. We showed that the correlation between the predictions of different algorithms is not very strong. These algorithms can be studied to identify their strengths and weaknesses to provide guidance to the endusers. Alternatively, this information could be used to design an instancebased selection algorithm that instead of combining the predictions of all of the algorithms, either selects the most suitable algorithm for each sequence or selects a number of candidates for AveRNA to combine.
References
Mathews D, Sabina J, Zuker M, Turner D: Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure1. J Mol Biol. 1999, 288 (5): 911940. 10.1006/jmbi.1999.2700.
Zuker M, Stiegler P: Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 1981, 9: 13310.1093/nar/9.1.133.
Do C, Woods D, Batzoglou S: CONTRAfold: RNA secondary structure prediction without physicsbased models. Bioinformatics. 2006, 22 (14): e9010.1093/bioinformatics/btl246.
Andronescu M, Condon A, Hoos H, Mathews D, Murphy K: Efficient parameter estimation for RNA secondary structure prediction. Bioinformatics. 2007, 23 (13): i1910.1093/bioinformatics/btm223.
Andronescu M, Condon A, Hoos HH, Mathews DH, Murphy KP: Computational approaches for RNA energy parameter estimation. RNA. 2010, 16: 23042318. 10.1261/rna.1950510.
Lu Z, Gloor J, Mathews D: Improved RNA secondary structure prediction by maximizing expected pair accuracy. RNA. 2009, 15 (10): 180510.1261/rna.1643609.
Hamada M, Kiryu H, Sato K, Mituyama T, Asai K: Prediction of RNA secondary structure using generalized centroid estimators. Bioinformatics. 2009, 25 (4): 46510.1093/bioinformatics/btn601.
Mathews DH: Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. Rna. 2004, 10 (8): 11781190. 10.1261/rna.7650904.
Quinlan J: Bagging, boosting, and C4. 5. Proceedings of the 13th National Conference on Artificial Intelligence. (1996), AAAI Press, 725730.
Rogic S, Ouellette B, Mackworth A: Improving gene recognition accuracy by combining predictions from two genefinding programs. Bioinformatics. 2002, 18 (8): 103410.1093/bioinformatics/18.8.1034.
Asur S, Ucar D, Parthasarathy S: An ensemble framework for clustering protein–protein interaction networks. Bioinformatics. 2007, 23 (13): i2910.1093/bioinformatics/btm212.
Avogadri R, Valentini G: Fuzzy ensemble clustering based on random projections for DNA microarray data analysis. Artif Intell Med. 2009, 45 (2–3): 173183.
Aghaeepour N, Finak G, Hoos H, Mosmann TR, Brinkman R, Gottardo R, Scheuermann RH etal: Critical assessment of automated flow cytometry data analysis techniques. Nature Methods. 2013, 10 (3): 228238. 10.1038/nmeth.2365.
Andronescu M, Bereg V, Hoos H, Condon A: RNA STRAND: the RNA secondary structure and statistical analysis database. BMC Bioinformatics. 2008, 9: 34010.1186/147121059340.
Zwieb C, Gorodkin J, Knudsen B, Burks J, Wower J: tmRDB (tmRNA database). Nucleic Acids Res. 2003, 31: 446447. 10.1093/nar/gkg019.
Sprinzl M, Vassilenko K: Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res. 2005, 33 (suppl 1): D139D140.
Cannone J, Subramanian S, Schnare M, Collett J, D’Souza L, Du Y, Feng B, Lin N, Madabusi L, Müller K etal: The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics. 2002, 3: 210.1186/1471210532.
Brown J: The ribonuclease P database. Nucleic Acids Res. 1999, 27: 314314. 10.1093/nar/27.1.314.
Andersen E, Rosenblad M, Larsen N, Westergaard J, Burks J, Wower I, Wower J, Gorodkin J, Samuelsson T, Zwieb C: The tmRDB and SRPDB resources. Nucleic Acids Res. 2006, 34 (suppl 1): D163D168.
GriffithsJones S, Moxon S, Marshall M, Khanna A, Eddy S, Bateman A: Rfam: annotating noncoding RNAs in complete genomes. Nucleic Acids Res. 2005, 33 (suppl 1): D121D124.
Andronescu M: Algorithms for predicting the secondary structure of pairs and combinatorial sets of nucleic acid strands. Vancouver, British Columbia, Canada, The University of British Columbia 2003
Andronescu M: Computational approaches for RNA energy parameter estimation. Vancouver, British Columbia, Canada, The University of British Columbia 2008
Aksay C, Salari R, Karakoc E, Alkan C, Sahinalp S: taveRNA: a web suite for RNA algorithms and applications. Nucleic Acids Res. 2007, 35 (suppl 2): W325
Tsang H, Wiese K: SARNAPredict: A study of RNA secondary structure prediction using different annealing schedules. IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology (CIBCB’07). 2007, IEEE, 239246.
Hesterberg T, Moore D, Monaghan S, Clipson A, Epstein R: Bootstrap methods and permutation tests. Introduction Pract Stat. 2005, 47 (4): 170.
Hansen N, Ostermeier A: Completely derandomized selfadaptation in evolution strategies. Evol Comput. 2001, 9 (2): 159195. 10.1162/106365601750190398.
Hansen N, Auger A, Ros R, Finck S, Pošík P: Comparing results of 31 algorithms from the blackbox optimization benchmarking BBOB2009. Proceedings of the 12th Annual Conference Comp on Genetic and Evolutionary Computation. 2010, ACM, 16891696.
Hansen N, Kern S: Evaluating the CMA evolution strategy on multimodal test functions. Parallel Problem Solving from NaturePPSN VIII. 2004, Springer, 282291.
Hajiaghayi M, Condon A, Hoos H: Analysis of energybased algorithms for RNA secondary structure prediction. BMC Bioinformatics. 2012, 13: 2210.1186/147121051322.
Deigan K, Li T, Mathews D, Weeks K: Accurate SHAPEdirected RNA structure determination. Proc Natl Acad Sci. 2009, 106: 9710.1073/pnas.0806929106.
Acknowledgements
We thank Anne Condon and Mirela Andronescu for their insightful comments on this work, and Dave Brent for help with setting up the web server for the AveRNA software. This work was supported by a MSFHR/CIHR scholarship to NA, a University of British Columbia’s graduate fellowship to NA, and by an NSERC discovery grant held by HH.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
Both authors declare that they have no competing interests.
Authors’ contributions
HH conceived the original idea. NA and HH designed the methodology, conceived the experiments, interpreted the results, and wrote the manuscript. NA implemented the methodology and performed the experiments. All authors read and approved the final manuscript.
Electronic supplementary material
12859_2012_6014_MOESM1_ESM.pdf
Additional file 1: Supplemental Information. A PDF file with supplementary figures and tables as described in the main text. (PDF 71 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Aghaeepour, N., Hoos, H.H. Ensemblebased prediction of RNA secondary structures. BMC Bioinformatics 14, 139 (2013). https://doi.org/10.1186/1471210514139
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1471210514139