Introduction

The increasing discoveries of noncoding RNAs demand more than ever the information about RNA structure (Bachellerie et al. 2002; Kertesz et al. 2007; He et al. 2008; Bartel 2009; Gong and Maquat 2011; Wang et al. 2013). For example, the 3D structure of a microRNA–target complex is crucial for understanding microRNA’s binding affinity and efficacy in gene regulation (Kertesz et al. 2007; Bartel 2009). However, the time-consuming, laborious, expensive experimental determination, such as X-ray crystallographic and NMR spectroscopic measurements, alone cannot catch up the pace with the rapidly increasing number of biologically significant RNAs such as noncoding regulatory RNAs. This problem highlights the need for computational prediction of RNA folding.

The structure of an RNA is determined by the complex pattern of base–base interactions, including base-paired secondary structures and long-range tertiary interactions. Existing RNA folding theories mainly focus on the secondary structures. However, knowing the secondary structure information alone is not sufficient to determine the 3D structure, because a 3D structure often involves long-range tertiary interactions such as kissing interactions between the different loops. Therefore, for a physics-based approach, accurate evaluation of the energetic parameters for tertiary interactions is critical for 3D structure prediction. Moreover, RNA function is correlated not only to the minimum free energy state of an RNA, but also to the potentially large conformational changes it can undergo. Understanding RNA function requires the understanding of the full energy landscape.

Statistical mechanics-based modeling has led to significant success in RNA structure prediction, folding stabilities, and folding kinetics for structures with different structural complexities (Liu et al. 2006; Chen 2008). For example, a recently developed statistical mechanics-based RNA folding model, “Vfold” model, has provided a wide range of quantitative predictions and novel insights for a variety of experiments and RNA functions, such as the pseudoknot-involved conformational switch between bistable secondary structures (Xu and Chen 2012), microRNA gene regulation through microRNA/mRNA-binding interactions (Cao and Chen 2012), and RNA/RNA dimerization critical for viral replication (Cao and Chen 2011; Cao et al. 2014). However, despite the success of this approach, several key issues remain. Estimation of the entropies for RNA tertiary folds and extraction of the energy parameters for noncanonical tertiary interactions from thermodynamic data or known structures present major challenges hampering the structural modeling for large and complex RNAs. The primary focus of this article is on the statistical mechanics-based methods for predicting RNA 3D structures and folding energy landscapes, and the related quantitative insights into RNA functions.

An overview of computational methods for RNA folding

RNA folding process is believed to be partly hierarchical, whereby secondary structural motifs fold first followed by the tertiary contacts formation. The secondary structure is a set of helices containing canonical base pairs (A–U, G–C, and G–U) and contributes to the major part of the free energy of an RNA system. Canonical base pairing and base stacking within helices are generally stronger than the non-canonical interactions in loop parts of an RNA system. Therefore, many computational models dissect the RNA folding problem into two steps: from sequence to two-dimensional (2D) structure and from 2D structure to three-dimensional (3D) structure, where a 2D structure is defined by base pairs including tertiary cross-linked base pairs such as kissing base pairs. With the 2D structure as constraint, the accuracy of 3D structure prediction can be significantly improved.

2D structure predictions

Computational models for RNA 2D structure prediction fall into two general categories: free energy minimization (Ding and Lawrence 2003; Hofacker 2003; Zuker 2003; Xayaphoummine et al. 2005; Mathews and Turner 2006; Parisien and Major 2008; Bellaousov et al. 2013; Xu et al. 2014), and sequence comparison (Gutell et al. 2002; Hofacker et al. 2002; Mathews and Turner 2002; Havgaard et al. 2005; Bindewald and Shapiro 2006; Bernhart et al. 2008; Sato et al. 2009). Most free energy minimization methods employ the empirical thermodynamic parameters [the Turner parameters (Turner and Mathews 2010)] for the different secondary structural elements. Other models, such as MC-Fold (Parisien and Major 2008), use knowledge-based energy functions extracted from the known PDB structures. However, not all the interactions (such as long-range tertiary contacts) can be captured by these parameters. As a result, the accuracy of prediction falls off rapidly with the length of the sequence, because larger RNAs are more prone to the formation of long-range tertiary contacts.

The accuracy of computational predictions is usually better for methods that consider “fold recognition”: structure is usually more conserved than sequence, and the functional core regions are usually more conserved at all levels. In general, sequence comparison-based methods can give more improved predictions than free-energy-based predictions if the homologous sequences are available. However, these methods depend strongly on the availability of the sequence database. To overcome the above limitations, several hybrid algorithms that combine free energy minimization and sequence comparison have been developed (Mathews and Turner 2002; Havgaard et al. 2005; Bernhart et al. 2008). For example, Dynalign (Mathews and Turner 2002) combines free energy minimization and comparative sequence analysis to find a low free energy structure common to two sequences without requiring any sequence identity. On average, Dynalign predicted 86.1 % of known base-pairs in the tRNAs, compared to 59.7 % by free energy minimization alone. For the 5S rRNAs, the average accuracy improves from 47.8 to 86.4 %.

Another way to improve the accuracy of structure prediction is to incorporate experimental data to the secondary structure prediction modeling (Mathews et al. 2004; Deigan et al. 2009; Low and Weeks 2010; Kladwang et al. 2011; Hajdin et al. 2013; Leonard et al. 2013). Selective 2′-hydroxyl acylation analyzed by primer extension (SHAPE) probing data has proved useful for RNA secondary structure modeling (Deigan et al. 2009; Low and Weeks 2010; Kladwang et al. 2011; Hajdin et al. 2013; Leonard et al. 2013). The SHAPE information provides refinements for the experimental determined thermodynamic parameters (Turner and Mathews 2010) for RNA folding. Benchmark test for a set of 21 RNAs of size from 34 to 530 nt shows that 93 % on average of known base pairs can be predicted. and all pseudoknots in well-folded RNAs can be identified (Hajdin et al. 2013).

Tertiary motifs: free energy models

Unlike the entropy (free-energy) parameters for simple loops (hairpin, bulge, and internal loops), which have been determined from thermodynamic experiments (Turner and Mathews 2010). Quantitative understanding of many other interactions remains very limited. Moreover, because of the possible conformational coupling between the different loops and between loops and helices, the loop entropies are not additive for tertiary motifs such as loop–loop kissing contacts (Fig. 1). Previous studies on the kissing complexes and other RNA folding systems such as pseudoknots suggested that a reliable estimation for the entropy is indispensable for folding predictions (Cao and Chen 2006, 2009; Andronescu et al. 2010a, b). Accurate treatment for the entropy and free energy for tertiary structure formation is a bottleneck.

Fig. 1
figure 1

A A schematic figure for the microRNA–target-binding complex. The entropic change upon the binding between microRNA and the mRNA ΔS(l eff, S) depends on the length of the binding site S and the effect loop length l eff. B A schematic diagram and all-atom structure of the hairpin–hairpin kissing complex

Dirks and Pierce (2003) introduced a simplified energy model for H-type pseudoknots:

$$ G_{\text{pseudo}} = \beta_{1} + \beta_{2} B^{\text{P}} + \beta_{3} U^{\text{p}} , $$

where β 1 is the penalty for introducing a pseudoknot, B p is the number of base pairs that border the interior of the pseudoknot, and U p is the number of unpaired bases inside the pseudoknot. Later, this energy model was extended (Sperschneider et al. 2011) to parameterize hairpin–hairpin kissing motifs, as shown in Fig. 1. In essence, by decoupling the interplay between helices and loops in the tertiary motifs, this energy model approximates the non-additive energy with an additive model.

Based on the polymer physics theory (de Cloizeaux 1974; Grosberg and Khokhlov 1994), Aalberts et al. (2013) proposed the following expressions (Meng and Aalberts 2013) for the free energy cost of stretching mRNA hairpin loops upon small RNA binding (Fig. 1):

$$ \beta G\left( {N,M,z} \right) = \frac{3}{2}\ln \left( {N^{\frac{6}{5}} a^{2} + M^{\frac{6}{5}} b^{2} } \right) + \beta C_{0} - \frac{5}{18}\ln \left( {\frac{z - 2.4}{{R_{\text{F}} - 2.4}}} \right) - \left( {N + M - 2} \right)\ln \left( {\frac{Na + Mb - z}{{Na + Mb - R_{\text{F}} }}} \right) + \left( {\frac{5}{{2R_{\text{F}} }} + \frac{5}{{18\left( {R_{\text{F}} - 2.4} \right)}} - \frac{N + M - 2}{{Na + Mb - R_{\text{F}} }}} \right)(z - R_{\text{F}} ), $$

where N is the number of single-stranded backbone segments (of length a = 6.2 Å), and M is the number of helix crossing segments (of length b = 15 Å). The Flory radius \( R_{\text{F}} = \left( {N^{\frac{6}{5}} a^{2} + M^{\frac{6}{5}} b^{2} } \right)^{1/2} \) represents the characteristic end-to-end separation of a self-avoiding chain (Aalberts and Hodas 2010). β = k B T and the constant C 0 can be set on the basis of experiment. The parameter z is the end-to-end separation of a helix, which can be calculated as

$$ z(s) = \left\{ {(hs)^{2} + r^{2} \left[ {1 - \cos \left( {\frac{2\pi s}{11}} \right)} \right]^{2} + r^{2} { \sin }^{2} \left( {\frac{2\pi s}{11}} \right)} \right\}^{1/2} = \left[ {(hs)^{2} + 4r^{2} { \sin }^{2} \left( {\frac{2\pi s}{11}} \right)} \right]^{1/2} $$

for an A-form RNA helix of (s + 1) base pairs, with h = 2.7 Å and r = 9.9 Å. We note that the freely jointed chain (FJC) model used to derive the above free energy upon small RNA binding to mRNA hairpin loops does not consider the excluded volume effect between the A-form helix and the single-stranded chain; moreover, the FJC model can only provide an estimation for long chains.

To compute entropy of hairpin, bulge/internal and multibranch loops of long length (up to 50 nt), with an efficient sampling method based on the sequential Monte Carlo principle, Zhang et al. (2008) developed optimized discrete k-state models based on RNA backbone conformations in known RNA structures. The method is general and can be applied to calculating entropy of loops with high complexity.

3D structure predictions

RNA 3D structure prediction is still at its early stage (Shapiro et al. 2007; Andersen 2010; Laing and Schlick 2011; Rother et al. 2011; Sim et al. 2012). Current RNA folding algorithms are generally limited to simple (short) structures, hampered by the challenges including adequate treatment of the conformational sampling and the evaluation of the energies for the tertiary contacts. Table 1 describes some of the most recently developed algorithms, ranging from coarse-grained modeling to various structure–assembly, and other conformational sampling approaches.

Table 1 A partial list for the computational models for RNA tertiary structure prediction and interactive manipulation

Coarse-grained approaches

Coarse-grained representation can largely reduce the degrees of freedom and thus enhance the conformational sampling. YAMMP/YUP (Wang et al. 1999; Tan et al. 2006) and NAST (Jonikas et al. 2009) represent RNA with just one pseudo-atom per nucleotide residue: P and C3′, respectively. iFoldRNA (Sharma et al. 2008) and Vfold (Cao and Chen 2005; Shi et al. 2014) represent RNA by three pseudo-atoms per residue. Ren (Xia et al. 2010, 2013) uses 5-bead to represent each nucleotide, and HiRE-RNA (Pasquali and Derreumaux 2010) uses six or seven pseudo-atoms for purine and pyrimidine residues, respectively. Coarse-grained systems are usually modeled with knowledge-based potentials that are derived from known structures. Combined with discrete molecular dynamics (DMD) (Ding et al. 2008) or other similar methods, this approach has the potential to predict structures and folding mechanism for large RNAs. For example, a recently developed 3-bead model (Shi et al. 2014) can achieve 3.5 Å RMSD on average for 46 small RNAs including pseudoknots. Combined with Monte Carlo-simulated annealing algorithm and a coarse-grained force field with implicit salt, the model may provide reliable predictions for the stability and salt effect with the mean deviation ∼1.0 °C of melting temperatures, compared with the extensive experimental data for 30 RNA hairpins.

Another coarse-grained approach is the graph theory-based tool (RAG) (Izzo et al. 2011; Kim et al. 2014) for sampling RNA global helical topologies. RAG represents RNA 2D structure as planar tree or dual graphs to assist the cataloging, analyzing, and designing of RNA structures. With the knowledge-based potential for internal loop orientations, such as bending and torsion of internal loops, the combination of graph theory and Monte Carlo-simulated annealing sampling shows great promise for assembling global features of RNA architecture: graph RMSDs range from 2.52 to 28.24 Å for RNAs of the size 25–158 nucleotides.

Structure–assembly approaches

Based on the assumption that the 3D fold is more conserved and can be recognized by the alignment of sequences and secondary structure patterns, the template-based modeling (Das and Baker 2007; Parisien and Major 2008; Das et al. 2010; Cao and Chen 2011; Zhao et al. 2012) has shown promising achievements in RNA 3D structure predictions. In structure-assembly approaches, RNA 3D structures are built based on the known structures modules ranging from fragments of 1–3 nucleotides to 2D structural motifs.

FARNA/FARFAR (Das and Baker 2007; Das et al. 2010) models RNA 3D structures by assembling of short fragments (1–3 nucleotides) from a single crystal structure via a Monte Carlo procedure guided by a knowledge-based energy function that encodes base-stacking and base-pairing potentials. It can reach atomic resolution (<3.0 Å) for most short RNAs (<30 nt). MC-Sym (Parisien and Major 2008) builds all-atom structures using the 3D version of the nucleotide cyclic motif (NCM) fragments. The 3D NCM library was built from a list of 531 known RNA 3D structures. Due to the limited NCM fragments for large, complex NCM motifs, such as 6-way junctions and kissing loops, current MC-Sym is limited to short RNAs requiring 2D structures as input. 3dRNA (Zhao et al. 2012) builds the whole RNA structure from the smallest secondary elements (SSEs) by a two-step assembling procedure. The SSEs are defined as base-pair hairpin, internal/bulge loop, pseudoknot loop and junction, which are extracted from known structures.

One of the common limitations to the structure–assembly approaches is the degree of divergence of the fragment library. Given the limited number of known RNA structures, structural motif templates with the required high sequence identity are difficult to attain. The lack of reliable structural motifs for loops and junctions greatly hampers accurate 3D RNA structure prediction. Moreover, the template-based structure prediction models cannot predict structures with “new” motifs.

As for the coarse-grained models, incorporating experimental data can dramatically improve the accuracy for the structure–assembly approaches. For instance, constraints using structural inference of native RNAs by high throughput contact mapping, such as the multiplexed hydroxyl radical (–OH) cleavage analysis (MOHCA), improve the FARNA’s prediction (Das et al. 2008). For the 158-nt P4–P6 domain of the group I intron, MOHCA leads to an improvement of RMSD from 35 Å with FARNA to 13 Å.

Sampling algorithms

One of the challenges for current RNA structure prediction is the problem of conformational sampling. Even for the DMD with knowledge-based energy functions at different coarse-grained levels, a major issue is that sampled conformations often remain close to the initial starting model (Sim et al. 2012). The molecular system is trapped in its local energy minima for the most part of the computational time, and the barriers between local minima on the energy landscape hinder transitions between different low-energy states. To overcome this difficulty requires the use of special simulation techniques (Li and Scheraga 1987; Rahman and Tully 2002; Minary et al. 2004; Curuksu and Zacharias 2009) to achieve effective sampling of conformational space.

A probabilistic model, called BARNACLE (Frellsen et al. 2009), allows for efficient sampling of RNA conformations in continuous space and with related probabilities. Using coarse-grained base-pairing information, BARNACLE generates reasonable RNA-like structures for small RNAs (<50 nt). However, the method is mostly limited to short RNAs because of the rapid increase in complexity of the probabilistic model.

Interactive manipulation

Many RNA structure design algorithms, such as RNA2D3D (Martinez et al. 2008) and Assemble (Jossinet et al. 2010), are quite efficient. This interactive graphical tools are useful to analyze and build RNA architectures, but have less ability for RNA structure predictions, since they rely on manual application of expert knowledge.

Vfold: from sequence to 3D all-atom structures

Vfold (web server (Xu et al. 2014): http://rna.physics.missouri.edu) is a model used to predict RNA 2D and 3D structures and the folding stability from the sequence. The model distinguishes itself from other models by two unique features: physics-based modeling of conformational entropy for 2D structure prediction, and template-based multiscale modeling for 3D structure prediction.

Entropy parameters for tertiary motifs

Using the P-C4′ and C4′-P virtual bonds to represent the backbone conformations, the Vfold model (Cao and Chen 2005) samples loops/junction conformations in the 3D space through conformational enumeration (Xu et al. 2014). By calculating the probability of loop formation, the model estimates the conformational entropy parameters for the formation of the different types of loops such as pseudoknot loops and hairpin–hairpin kissing motifs. The model has the advantage of accounting for chain connectivity, excluded volume, and the completeness of conformational ensemble. Studies by us and other groups show that an accurate entropy parameter improves the prediction of RNA secondary structures and thermodynamic stabilities (Andronescu et al. 2010). Here, we use the hairpin–hairpin kissing motif to illustrate the Vfold calculation for the entropy of an RNA/RNA kissing complex.

The hairpin–hairpin kissing complex, shown in Fig. 1B, consists of three stems and four loops. We assume loop l 2 and l 4 are short, with ≤1 nucleotide, which favors the formation of coaxial stacking interaction between stem H 1 and H 2 and between H 2 and H 3. Therefore, the entropic cost upon the formation of loop–loop kissing S(H 2, l 1, l 3) depends on the length of the stem H 2 and the single-stranded loops of l 1 and l 3. The computation involves three steps:

  1. (1)

    Due to the nature of the coaxial stacking between stems of H 1, H 2, and H 3, the relative orientation between stems of H 1 and H 3 is determined by the length of stem H 2. The coordinates of the 8 nt (\( a_{\text{i}} ,a_{i}^{{\prime }} ,a_{j} ,a_{j}^{{\prime }} ,b_{i} ,b_{i}^{{\prime }} ,b_{j} ,b_{j}^{{\prime }} , \) shown in Fig. 1B) are adopted from the known NMR structure as the template. The final coordinates of the 8 nt for different length of H 2 are generated according to the A-formed H 2 and the template.

  2. (2)

    For each helix orientation, with well-defined (a i , a j ) of the starting and ending nucleotides for the loop l 1 and (b i , b j ) of the starting and ending nucleotides for the loop l 3, we model loop conformations as self-avoiding walks of the virtual bonds on diamond lattice (Cao and Chen 2005) to sample loops/junctions 3D conformations. The connection between the A-form helix and the discrete loop conformations is realized through an iterative optimized algorithm (Ferro and Hermans 1971).

  3. (3)

    A key issue in the conformational count is the excluded volume interaction between loop and helix and between the different loops. In the Vfold model, this can be explicitly taken into account by disallowing overlapping virtual bonds when the loop conformations are generated in the virtual bond diamond lattice. Assuming the interactions in the loops are weaker than the base stacking interactions that stabilize a 2D structure, we can estimate the loop entropy parameter as the logarithm of the conformational count.

The Vfold-predicted loop entropies (Table 2) enable folding free energy calculations for RNA/RNA complex such as for microRNA–target-binding (Cao and Chen 2012) and hairpin–hairpin kissing complex systems (Cao and Chen 2011; Cao et al. 2014). For example, for the kissing complex shown in Fig. 2, the free energy is computed as

$$ \Updelta G = \Updelta G_{\text{stem1}} + \Updelta G_{{{\text{stem}}2}} + \Updelta G_{\text{kissing - stem}} - T\Updelta S\left( {S,l_{1} ,l_{2} ,l_{3} ,l_{4} } \right) = - 15.7 - 15.7 - 14.2 + 8.62(k_{\text{B}} T) = - 40.2\;{\text{kcal/mol}}, $$

where S is the number of base pairs in the kissing stem, and l 1, l 2, l 3, l 4 are the length of loops 1, 2, 3, and 4, respectively. The entropic energy of kissing loop is estimated by \( k_{\text{B}} \ln \frac{{\omega_{6,2,2} }}{{\omega_{\text{coil}} }}, \) with \( { \ln }\omega_{6,2,2} \) is read from Table 1, and \( { \ln }\omega_{\text{coil}} \) = 2.05 l + 0.1 (l is the chain length of loop l 1 or l 3) from the polymer physics theory.

Table 2 The Vfold-derived conformational entropies \( { \ln }\omega_{{H_{2} , l_{1} , l_{3} }} \) for the kissing complex for the different stem lengths and different loop lengths
Fig. 2
figure 2

The evaluation of the free energy for a hairpin–hairpin kissing complex using the loop entropy parameters in Table 1 in Cao and Chen (2011) and the Turner parameters (Turner and Mathews 2010)

To treat more complicated (general) loop–loop kissing complexes, as shown in the Fig. 2 of Cao et al. (2014), the stem-loop substructure’s impact on loop entropy is approximated by replacing the terminal base pairs (of the stems) by single nucleotides. Then, the effective loop lengths of l 1, l 2, l 3, and l 4 are equal to the sum of the number of unpaired nucleotides and the number of stem-loop substructures. This approximation ignores the (weak) excluded volume interference between the stem-loop substructure and the loop, thus enabling us to treat general kissing motifs.

Template-based RNA 3D structure prediction

Predicting RNA 3D structure is not a solved problem (Shapiro et al. 2007; Andersen 2010; Laing and Schlick 2011; Rother et al. 2011; Sim et al. 2012). Extensive efforts have been made to enhance the conformational sampling (Das and Baker 2007; Frellsen et al. 2009) and to establish accurate scoring functions for the ranking of the different structures (Ding et al. 2008; Parisien and Major 2008). The Vfold model can predict the 2D structures of RNA/RNA kissing complex including inter- and intramolecular base pairings. In general, a 2D structure can correspond to a large number of 3D structures due to the multiplicity of flexible loop conformations. The Vfold model-predicted virtual bond structure provides a scaffold for the construction of all-atom models of the 3D structure. The prediction of the all-atom 3D structures from a given sequence and 2D structure (base pairs) involves the following three steps (see Fig. 3):

Fig. 3
figure 3

A The 2D structure of the BWYV pseudoknot. Vfold identifies it as a motif of “PK(5-2-1-7-3)”. B The virtual-bond (low-resolution) structure built from the motif-based template library. C The all-atom 3D structure refined by Amber energy minimization

  1. (1)

    To build the 3D virtual bond structure. Helices are modeled as A-form virtual-bonded helix structures. The loop/junction structures are built from the virtual bond fragments of the template structures. To identify the optimal template structure for the loops/junctions, the model screens the pretabulated template library according to the loop size (first) and the sequence (second) matches. If necessary, this step may involve sequence replacement in order to match the (same size) sequences in the template library. The model assembles the helix and loop 3D virtual-bonded structures to construct the 3D scaffold of the whole RNA.

  2. (2)

    To add all atoms to the virtual-bonded structure. For nucleotides in each helix, atoms are added according to the A-form helix atomic structure. The 3D conformation of the nucleotides in loops are generated by adding atoms according to the templates for base configurations, by aligning the C1′, N9, C4, and C8 for purine (A or G) or C1′, N1, C2, and C6 for pyrimidine (C or U) with those of a nucleotide in a helix. This step results in an “atomistic version” of the Vfold structure. Using three atoms instead of one atom per base, the current Vfold can better capture the base orientation from templates and also can easily replace base type accordingly.

  3. (3)

    Energy minimization of the whole atomistic structure using AMBER molecular dynamics simulations. The above pre-refinement structure may contain atoms/groups that clash sterically with each other. Such steric clashes can be readily resolved by the all-atom molecular dynamics simulations. With the above pre-refinement structure as the initial state, the AMBER energy minimization (Case et al. 2005) yields reliable predictions for all-atom 3D structures. In the energy minimization, the negative charges on phosphates are neutralized by Na+ cations added to the solution. The nonbonded interactions are truncated at 12 Å. Water molecules are treated by the standard TIP3P model included in AMBER software. For the most predicted structures, we found that the minimization causes only small change in the RMSD of the structure. The main advantage of the multiscale approach in the Vfold model is that the virtual-bonded tertiary structures as the initial state may already lie in the free energy basin, so the structure refinement can avoid large structural rearrangements and can thus lead to the native structure effectively.

The Vfold model predicts the 3D structure for a 2D structure based on the structural templates. To construct the template library, the model classifies the structure into different motifs, such as helices, hairpin loops, internal/bulge loops, pseudoknots, and N-way junctions (N ≥ 3). The motif-based template library was built from 2621 PDB structures. With the increasing number of known RNA structures, the larger and more divergent pools of the known loop/junction structures with the different types and different lengths would lead to better predictions of the 3D structure.

As shown in Fig. 3, the above strategy gives reliable predictions for the all-atom 3D structures for simple tertiary folds, such as pseudoknots and hairpin–hairpin kissing complexes. The predicted structures as a 3D scaffold will provide highly needed guidance for experiments. For example, the sequential resonance assignments from the Nuclear Overhauser Effect (NOE) data may become efficient and more accurate if the information on the nucleotide spatial proximity from the predicted (low-resolution) structure is combined with for the NMR structural determination of RNA.

Quantitative prediction for the folding of HIV-1 DIS complex

Intermolecular loop–loop base pairing is a widespread and functionally important tertiary structure motif in RNA. Loop–loop interactions often facilitate dimerization reactions between RNA molecules. For example, in HIV-1 virus, the loop–loop kissing interaction is critical for one form of HIV-1 dimerization (Laughrea and Jette 1994; Muriaux et al. 1996; Paillart et al. 2004). In bacteria, loop–loop interaction can regulate gene expression and affect replication and translation of the bacteria (Schmidt et al. 1995; Argaman and Altuvia 2000; Repoila et al. 2003; Bossi and Figueroa-Bossi 2007; Vogel and Wagner 2007). A well-documented case is OxyS RNA repression of fhlA translation in Escherichia Coli through the formation of a stable loop-kissing interaction (Argaman and Altuvia 2000).

The dimerization process is essential for the HIV-1 replication. Muriaux et al. proposed a two-step dimerization process (Muriaux et al. 1996a, b). The kissing loop–loop complex is formed followed by a conversion to form the extended-duplex dimer due to temperature increase or protein binding. Both the kissing-loop dimer and the extended-duplex have been found in the structural measurement (Ennifar et al. 1999, 2001). Due to the lack of the thermodynamic parameters for the kissing-loop dimer, it has been difficult to determine the relative population of each dimer at the different temperature. Also, it would be biologically important to understand if the kissing-loop dimer is a kinetic intermediate or a thermodynamic stable state at room temperature. Vfold model provides a useful tool to quantitatively predict the thermodynamic stabilities for the different dimes by computing the free energy landscape of the two-stranded system (Cao and Chen 2011, 2012; Cao et al. 2014). Recently developed RNA structure prediction models are good at predicting some structures at high-accuracy resolution. For example, de novo predictive models can accurately predict the simple and short hairpin and internal loop structures (Das and Baker 2007; Ding et al. 2008; Parisien and Major 2008). However, the models cannot predict the kissing complex. The Vfold model enables the prediction of kissing complexes (Cao and Chen 2011).

Quantitative prediction of HIV-1 DIS complex requires modeling of the folding energy landscape and the structures of dimers. The partition function for the two-stranded system \( Q(T) = Q_{1} \cdot Q_{2} + e^{{ - \Updelta G_{\text{associate}} /k_{\text{B}} T}} \cdot Q_{12} \) is the sum over the unbound and bound systems. Here, Q 1, Q 2, and Q 12 are the partition functions of the (unbound) strand 1 RNA, of the (unbound) strand 2 RNA and of the kissing (bound) system, respectively. ΔG associate is dependent on the RNA concentration C T: \( \Updelta G_{\text{associate}} = \Updelta G_{\text{init}} - k_{\text{B}} T{ \ln }(C_{\text{T}} /4). \) We choose ΔG init to be 4.1 kcal/mol according to the experimental result (Serra and Turner 1995; Zuker 2003). The calculation of Q 1 and Q 2 for single-stranded RNA can be achieved by many RNA secondary structure prediction models; however, the computation of Q 12 requires a statistical mechanical model such as the Vfold model.

The predicted free energy landscape for HIV-1 Mal shows two free energy minima, indicating two coexisting structures at room temperature, shown in Fig. 4A. The structural (base-pairing probability) calculations show that the free energy minima correspond to the kissing-loop dimer and the extended-duplex dimer, respectively. The extended-duplex dimer is slightly more stable than the kissing-loop dimer, with the free energy difference ΔG < 1.0 kcal/mol. The result suggests that the two modes of dimerization of HIV-1 Mal can coexist in thermodynamic equilibrium and can possibly interconvert with the change of the temperature and solution condition.

Fig. 4
figure 4

A The free energy landscape for the HIV-1 Mal dimer at T = 20 °C. The Vfold model predicts two coexisting structure (I, II), corresponding to the extended-duplex and kissing-loop dimers; respectively. In the energy landscape, N and NN are the numbers of the native and non-native base pairs, respectively. B The Vfold predicts 3D structures (in orange) for the kissing-loop and extended-duplex dimers for HIV-1 Mal dimer. The all-atom RMSDs are 3.1 and 2.9 Å with respect to the experimental structures (in gray) with PDB IDs 1xpe and 462d, respectively

Moreover, based on the Vfold study, we find that the kissing-loop dimer of HIV-1 Mal is stabilized by the coaxial stacking. We built the 3D structure of the kissing-loop dimer according to the above multiscale strategy. The all-atom RMSD between predicted structure and the experiment solved NMR structure (PDB id: 1xpe) is 3.1 Å, as shown in Fig. 4B. For the extended-duplex dimer (structure I on the energy landscape), Vfold predicts the 3D structure with an RMSD of 2.9 Å (PDB id: 462d).

Conclusions

The bottleneck for RNA tertiary structure prediction is the inability to treat the free energy, especially the entropy, for structures with long-range tertiary interactions. The virtual-bond based low-resolution conformational model (Vfold) allows us to estimate the entropy and the full free energy landscape for RNA tertiary global folds. The predicted 2D structures provide scaffolds for the construction of all-atom 3D models through molecular dynamics calculations. Validation by the experimental data for the RNA 2D and 3D structures and the folding thermodynamics as well as kinetics suggests that the statistical mechanics-based approaches can be quite reliable.

By considering the non-canonical interactions at both secondary structure and tertiary structure levels, Vfold model improves the accuracy of the secondary structure prediction and introduces more detailed constraints, besides the canonical base-pairing information, to the 3D structure modeling. However, a fast sampling algorithm balancing the completeness of the conformational space including non-canonical base-pairing information and the computational time is needed to treat large RNAs. Moreover, further development of the model should go beyond the simple hairpin–hairpin kissing complexes by estimating the entropic parameters for global folds with more complicate tertiary interactions, such as the SAM riboswitch.

With the rapidly growing size of the database of the experimentally measured RNA structures, motif template-based methods shows increasingly promising results, especially when the homologous conformations can be identified from the known structures. However, a backup plan is always needed for a good model if no known homologous conformations can be found in the PDB database. For instance, there is only a few of hairpin–hairpin kissing motifs in the current motif library. Further development of the model should address the motifs involving tertiary interactions with the ability of de novo construction.