The distance distribution within an RNA molecule is of interest in various contexts. Most directly, the question arises whether panhandle-like structures (in which 3’ and 5’ ends of long RNA molecules are placed in close proximity) are the rule or an exception. Panhandles have been reported in particular for many RNA virus genomes. Several studies [14] agree based on different models that the two ends of single-stranded RNA molecules are typically not far apart. On a more technical level, the problem to compute the partition function over RNA secondary structures with given end-to-end distance d, usually measured as the number of external bases (plus possibly the number of structural domains) arises for instance when predicting nucleic acid secondary structure in the presence of single-stranded binding proteins [5] or in models of RNA subjected to pulling forces (e.g. in atom force microscopy or export through a small pore) [68]. It also plays a role for the effect of loop energy parameters [9].

In contrast to the end-to-end distance, the graph-distance between two arbitrarily prescribed nucleotides in a larger RNA structure does not seem to have been studied in any detail. However, this is of particular interest in the analysis of single-molecule fluorescence resonance energy transfer (smFRET) experiments [10]. This technique allows to monitor the distance between two dye-labeled nucleotides and can reveal details of the kinetics of RNA folding in real time. It measures the non-radiative energy transfer between the dye-labeled donor and acceptor positions. The efficiency of this energy transfer, E fret , strongly depends on the spatial distance R according to E fret = R 0 6 /( R 0 6 + R 6 ). The Förster radius R0 sets the length scale, e.g. R0≈54 Å for the Cy3-Cy5 dye pair. A major obstacle is that, at present, there is no general and efficient way to link smFRET measurements to interpretations in terms of explicit molecular structures. To solve this problem, a natural first step is to compute the distribution of spatial distances for an equilibrium ensemble of 3D structures. Since this is not feasible in practice despite major progress in the field of RNA 3D structure prediction [11], we can only resort to considering the graph-distances on the ensemble of RNA secondary structures instead. From a computer science point of view, furthermore, we show here that the distance distribution can be computed exactly using a dynamic programming approach. Although a crude approximation of reality, our initial results indicate that the graph-distance can be related to the smFRET data such as those reported by [12] and help to explain effects of RNA structures in pre-mRNA splicing and viral subgenomic RNA species.


RNA secondary structures

An RNA secondary structure is a vertex labeled outerplanar graph G(V,x,E), where V = {1,2,…,n} is a finite ordered set (of nucleotide positions) and x : {1,2,…,n} → {A,U,G,C},ix i assigns to each vertex at position i (along the RNA sequence from 5’ to 3’) the corresponding nucleotide x i . We write x = x1x n for the sequence underlying secondary structure and use x[ ij] = x i x j to denote the subsequence from i to j. The edge set E is subdivided into backbone edges of the form {i,i + 1} for 1 ≤ i<n and a set B of base pairs satisfying the following conditions:

  1. (i)

    If {i,j} ∈ B then x i x k ∈ {GC,CG,AU,UA,GU,UG};

  2. (ii)

    If {i,j} ∈ B then |j - i| > 3;

  3. (iii)

    If {i,j},{i,k} ∈ B then j = k;

  4. (iv)

    If {i,j},{k,l} ∈ B and i < k < j then i < l < j.

The first condition allows base pairs only for Watson-Crick and GU base pairs. The second condition implements the minimal steric requirement for an RNA to bend back on itself. The third condition enforces that B forms a matching in the secondary structure. The last condition (nesting condition) forbids crossing base pairs, i.e. pseudoknots.

The nesting condition results in a natural partial order in the set of base pairs B defined as {i,j} ≺ {k,l} if k < i < j < l. In particular, given an arbitrary vertex k, the set B k = {{i,j} ∈ B|ikj} of base pairs enclosing k is totally ordered. Note that k is explicitly allowed to be incident to its enclosing base pairs. A vertex k is external if B k = . A base pair {k,l} is external if B k = B l = {{k,l}}.

Consider a fixed secondary structure G, for a given base pair {i,j} ∈ B, we say a vertex k is accessible from {i,j} if i < k < j and there is no other pair {i,j} ∈ B such that i < i < k < j < j. The unique subgraph i , j induced by i, j, and all the vertices accessible from {i,j} is known as the loop of {i,j}. The type of a loop i , j is unique determined depending on whether {i,j} is external or not, and the numbers of unpaired vertices and base pairs. For details, see [13]. Each secondary structure G has a unique set of loops { i , j |{i,j}B}, which is called the loop decomposition of G. The free energy f(G) of a given secondary structure, according to the standard energy model [14], is defined as the sum of the energies of all loops in its unique loop decomposition.

The relative location of two vertices v and w in G is determined by the base pairs B v and B w that enclose them. If B v B w , there is a unique ≺-minimal base pair {iv,w,jv,w} that encloses both vertices and thus a uniquely defined loop { i v , w , j v , w } in the loop associated with v and w. If B v B w = or B w B v = then v or w is unpaired and part of { i v , w , j v , w } . Otherwise, i.e. B v B w = , there are uniquely defined ≺-maximal base pairs {k v ,l v } ∈ B v B w and {k w ,l w } ∈ B w B v that enclose v and w, respectively. We note that B v B w (B w B v ) may be empty, in which case {k v ,l v } ({k w ,l w }) is also empty. This simple partition holds the key to computing distance distinguished partition functions below.

In the following, we assign the weights a for backbone edges and b for base pairs, respectively. Given a path p, we define the weight of the path d(p) as the sum of the weights of edges in the path. The (weighted) graph-distance d v , w G in G is defined as the weight of the path p connecting v and w with d(p) being minimal. For the weights, we require the following condition:

  • (W) If i and j are connected by an edge, then {i,j} ∈ E is the unique shortest path between i and j.

This condition ensures that single edges cannot be replaced by detours of shorter weight. Condition (W) and property (ii) of the secondary structure graphs implies b < 3a because the closing base pair must be shorter than a hairpin loop. Furthermore, considering a stacked pair we need b < b + 2 a, i.e. a > 0. We allow the degenerate case b = 0 that neglects the traversals of base pairs.

Before we continue with the calculations of the partition function, let us first consider the problem formulation in more detail. For the FRET application, it is well-known that FRET efficiency is correlated with spatial distance. Furthermore, only a limited range of distance changes (e.g. 20 Å-100 Å for Cy3-Cy5) can be reported by the FRET experiments. Thus a more useful formulation of our problem is not to use the full expected quantity for all positions. Instead, we are interested in the average for all distance-values within some threshold θ d . As the space and time complexity will depend on the number of distances we consider, we will parametrise our complexity by the number of nucleotides n and the number of distances considered D = θ d + 1, as well. In the worst case, there is D = O (n). However, given that in practice only a limited range of distance changes are considered, we rather view D = O(1) as a small constant in our contribution.

Boltzmann distribution of graph-distances

For a fixed structure G, d v , w G is easy to compute. Here, we are interested in the distribution Pr[ d v , w G |x] and its expected value d v , w =E[ d v , w G |x] over the ensemble of all possible structures G for a given sequence x. Both quantities can be calculated from the Boltzmann distribution P r[ G|x] = e- f (G)/RT/Q where Q= G e - f ( G ) / RT denotes the partition function of the ensemble of structures. As first shown in [15], Q and related quantities can be computed in quartic time. A reduction to a cubic algorithm may be obtained if the free energy of long interior loops may be regarded as prohibitive. This restriction has been widely used for long sequences [16]. Cubic runtime can also be achieved for some but not all parametrizations of interior loop energies [17].

A crucial quantity for our task is the restricted partition function

Z v , w [ d ] = G with d v , w G = d e - f ( G ) / RT

for a given pair v,w of positions in a given RNA sequence x. A simple computation (Appendix A in Additional file 1) verifies that the Pr[ d v , w G =d|x]= Z v , w [d]/Q and d v , w =E[ d v , w G |x]= d ( Z v , w [d]/Q)d. Hence it suffices to compute Zv,w[d] for any 1 ≤ dn. In the following sections we show that this can be achieved by a variant of McCaskill’s approach [15].

For the ease of presentation we describe in the following only the recursion for the simplified energy model for the “circular maximum matching”, in which energy contributions are associated with individual base pairs rather than loops. Our approach can be easily extended to the full model by using separating the partition functions into distinct cases for the loop types.

We use the letters Z and Y to denote partition functions with distance constraints, while Q is used for quantities that appear in McCaskill’s algorithm and are considered as pre-computed here. For instance, let Q i , j B denote the partition function over all secondary structures on x[ i..j] that are enclosed by the base pair {i,j}. We will later also need the partition function Qi,j over the sub-sequence x [ i..j], regardless of whether {i,j} is paired or not. In Additional file 1: Appendix C, we summarize the notations frequently used in our contribution.

Recursions of Zv,w[ d]: The case when v and ware external

An important special case assumes that both v and w are external. This is the case e.g. when v and w are binded by proteins. In particular, the problem of computing end-to-end distances, i.e., v = 1 and w = n, is of this type.

Assuming (W), the shortest path between two external vertices v,w consists of the external vertices and their backbone connections together with the external base pairs. We call this path the inside path of i,j since it does not involve any vertices “outside” the subsequence x [ i..j].

For efficiently calculating the internal distance between any two vertices v,w, we denote by Z i , j I [d] the partition function over all secondary structures on x [ i..j] with distance exactly d.

Now note that any structure on x[ i..j] starts either with an unpaired base or with a base pair connecting i to some position k satisfying i < kj. In the first case, we have d i , j G = d i , i + 1 G + d i + 1 , j G where d i , i + 1 G =a. In the second case, there exists d i , j G = d i , k G + d k , k + 1 G + d k + 1 , j G with d i , k G =b and d k , k + 1 G =a. Thus, Z i , j I [d] can be split as follows,

This gives the recursion

Z i , j I [d]= Z i + 1 , j I [d-a]+ i < k j Q i , k B Z k + 1 , j I [d-b-a]

with the initialization Z ii I [0]=1 and Z ii I [d]=0 for d > 0. For consecutive vertices, we have Z i , i + 1 I [a]=1 and Z i , i + 1 I [d]=0 for da. These recursions have been derived in several different contexts, e.g. force induced RNA denaturations [6], the investigate of loop entropy dependence [9], the analysis of FRET signals in the presence of single-stranded binding proteins [5], as well as in mathematical studies of RNA panhandle-like structures [3, 4].

In the following, it will be convenient to define also a special term for the empty structure. Setting Z i , i - 1 I [-a]=1 and Z i , i - 1 I [d]=0 for d ≠ -a allows us to formally write an individual backbone edge as two edges flanking the empty structure and hence to avoid the explicit treatment of special cases. This definition of ZI also includes the case that i and j are base paired in the recursion (1). This is covered by the case k=j, where we evaluate Z j + 1 , j I [d-b-a]. Since d = b is the only admissible value here, this refers to Z j + 1 , j I [-a], which has the correct value of 1 due to our definition. Later on, we will also need ZI under the additional condition that the path starts and ends with a backbone edge. We therefore introduce Z I defined as by

Z i , j I [d]= Z i + 1 , j - 1 I [d-2a]

Note that if Z i , j I [d] is called with j = i + 1, then we call Z i + 1 , i I [d-2a]. The only admissible value again is the correct value d=a. In sum, we have the following

This recursion requires O(n3D) time and O(n2D) space. It is possible to reduce the complexity of computing the expected distance in this special case by a linear factor. The trick is to use conditional probabilities for arcs starting at i or the conditional probability for i to be single-stranded, which can be determined from the partition function for RNA folding [3], see Additional file 1: Appendix B.

Recursions of Zv,w[ d]: the general case

The distance between two positions v and w that are covered by an arc can be realized by both inside paths and outside paths. Here, “outside” emphasizes that the shortest path between two positions v and w contains vertex does not belongs to x[ v,w]. This case complicates the algorithmic approach, since both types of paths must be controlled simultaneously. Consider Figure 1, the shortest path between the green and blue regions includes some vertices outside the interval between these two regions. The basic idea is to generalize Equation (1) to computing the partition function Zv,w[ d]. The main question now becomes how to recurse over decompositions of both the inside and the outside paths.

Figure 1
figure 1

Inside and outside paths. The shortest path (violet arrows) from v (green) to w (blue) is not an inside path: inside emphasizes that, in contrast to the shortest path (cyan arrows) between the red region and w, it is not contained in the interval determined by its end points.

Figure 1 shows that the outside paths are important for the green region, i.e., the region that is covered by an arc. Hence, we have to consider the different cases that the two positions v and w are covered by arcs. The set Ω of all secondary structures on xcan be divided into two disjoint subclasses that have to be treated differently:

  •  : v and w are not enclosed in a common base pair, i.e., B v B w = .

  •  : there is a base pair enclosing both v and w, i.e., B v B w .

Note that this bipartition explicitly depends on v and w. In the following, we will first introduce the recursions that are required in Ω0 structures to compute Zv,w[ d].

Contribution of Ω0 structures to Zv,w[ d]: Z 0 v , w [d]

One example of this case is given in Figure 1 with the red and blue region, where v (vertex in green region) is covered by an arc, and w (vertex in blue region) is external. Denote the ≺-maximal base pair enclosing v by {i,j}. Since at most one of v and w is covered by an arc, we know that j < w. Hence, every path p from v to w, and hence also the shortest paths (not necessarily unique) must run through the right end j of the arc {i,j}. More precisely, there must sub-paths p1 and p2 with d (p) = d(p1) + d (p2) + a such that v p wv p 1 j-(j+1) p 2 w, where i p j denotes that p is a shortest path from i to j and - denotes a single backbone edge. For the shortest path from v to j, it consists either of a shortest path v p i and the arc {i,j}, or it goes directly to j without using the arc {i,j}.

How does this distinction translate to the partition function approach? If we want to calculate the contribution of this case to the partition function Zv,w[ d], we have to split both the sequence x[ i,w] and distance d as follows


where Z j , w I [ d 2 ] is the partition function starting and ending with a single-stranded base as defined in Equation (2), and Z i , j B , v [ d , d r ] is the partition function consisting of all structures of x[ i,j] containing the base pair {i,j} with the property that the shortest path from v to i has length d and the shortest path from v to j has length d r . In addition, d, d r and d2 must satisfy d = d r + d2.

The remaining cases for the contribution of the class Ω0 to Zv,w[ d] are given by all other possible combinations of v and w being single-stranded or being covered by an arc, i.e.,

To simplify, we extend the definition of Z i , j B , v [ d , d r ] by setting Z v , v B , v [0,0]=1 and Z v , v B , v [ d , d r ]=0 for d + d r > 0. This allows us to conveniently model all cases where either v or w are external, i.e., a.), b.), and d.), as special cases of c.).

In case c.), we have to split the distance d into five sub-distances d l , d r , d l , d r , d I , in which d I can be retrieved from the first four distances. Furthermore, we would require four splitting positions for the sequence for all possible combinations of i,j,k,l. A naïve implementation of this idea would result in an algorithm with time complexity O(n6D5) and space complexity O(n2D2).

A careful inspection shows, however, that the split of the distances for the arcs into d and d r is unnecessary. Since we want to know only distance to the left/right end, we can simply introduce two matrices Z i , j B , v , [d] and Z i , j B , v , r [d] that store these values. These matrices can be generated from Z i , j B , v [ d , d r ] as follows:

Z i , j B , v , [ d ] = d r d r + b d Z i , j B , v [ d , d r ] + d d > d Z i , j B , v [ d , d - b ]

Analogously, we compute Z i , j B , v , r [d]. In this way, we split the distance d into three contributions and we require four splitting positions for the sequence for all possible combinations of i,j,k,.

Therefore, the contribution to Zv,w[ d] for structures in Ω0 is given by

Z 0 v , w [ d ] = d 1 , d 2 d 1 + d 2 d i , j , k , l i v j < k w l Q 1 , i - 1 · Z i , j B , v , r [ d 1 ] · Z j , k I [ d - ( d 1 + d 2 ) ] · Z k , l B , w , [ d 2 ] · Q l + 1 , n

Note that for splitting the distance, we reuse the same indices (e.g., the j in Z i , j B , v , r [ d 1 ]· Z j , k I [d-( d 1 + d 2 )], where as for the remaining partition function, we use successive indices (e.g.,the i in Q 1 , i - 1 · Z i , j B , v , r [ d 1 ]). This difference comes from the fact that splitting a sequence into subsequences is done naturally between two successive indices, whereas splitting a distance is naturally done by splitting at an individual position. We have only to guarantee that the substructures which participate in the split do agree on the structural context of the split position. This is guaranteed by requiring that Z I starts and ends with a backbone edge. We note that the incorporation of the full dangling end parameters makes is more tedious to handle the splitting positions.

This results in a complexity of O(n6D3) time and O(n2D) space. However, we do not need to split in i,j,k,l simultaneously. Instead, we could split case (c) at position j and introduce for all vj and kw the auxiliary variables

Z 1 , j B , v , r [ d 1 ] = i v Q 1 , i - 1 · Z i , j B , v , r [ d 1 ] Z k , n B , w , [ d 2 ] = w l Z k , l B , w , [ d 2 ] · Q l + 1 , n Z j , n I B , w , [ d ] = k > j d 2 d Z j , k I [ d - d 2 ] · Z k , n B , w , [ d 2 ] .

Finally, we can replace recursion (3) by

Z 0 v , w [d]= v j d 1 d Z 1 , j B , v , r [ d 1 ]· Z j , n I B , w , [d- d 1 ]

We thus arrive at O(n3D2) time and O(n2D) space complexity for the contribution of Ω0 structures to Zv,w[ d], excluding the complexity of computing Z i , j B , v [ d , d r ].

Contribution of Ω1 structures to Zv,w[ d]

Ω1 contains all cases where v and w are covered by a base pair. In the following, let {p,q} be the ≺-minimal base pair covering v and w. In principle, this case looks similar to the case for Ω0. However, we have to take into considerations the paths between v and w over the base pair {p,q}. Thus, we need to store the partition function for all inside and outside for each ≺-minimal arc {p,q} that covers v and w, which we will call Z p , q v , w [ d O , d I ]. In principle, a similar recursion as defined for Z 0 v , w in equation (3) can be derived, with the additional complication since we have to take care of the additional outside distance due to the arc {p,q}. Thus, we obtain the following splitting:

Again we can avoid the complexity of simultaneously splitting at {i,j}and {k,l} by doing a major split after j. Thus, we get the following picture,

which leads to the following equivalent recursions:

Y p , j B , v , r [d, d r ]= p < i v d O d Z p , i I [ d O ]· Z i , j B , v [d- d O , d r ]
Y k , q B , w , [ d , d ]= w l < q d O ′′ d Z k , l B , w [ d , d - d O ′′ ]· Z l , q I [ d O ′′ ]
Y j , q I B , w , [ d I , d ]= j < k < q d d I Z j , k I [ d I - d ]· Y k , q B , w , [ d , d ]

Overall, we get the following recursion:

Z p , q v , w [ d O , d I ]= v j d r d I d d O Y p , j B , v , r [d, d r ]· Y j , q I B , w , [ d I - d r , d O -d]

We can now define Zv,w[ d] by

Z v , w [ d ] = Z 0 v , w [ d ] + Q ̂ p , q b · { p , q } { v , w } d I d + b Z p , q v , w [ d , d I ] + { p , q } { v , w } d < d O + b Z p , q v , w [ d O , d ]

where Q ̂ p , q b is the external partition function over all structures on the union of the intervals x[ 1..p] ∪ x[ q..n] so that {p,q} is a base pair. Since the base pair probability can be written as Pr({p,q})= Q ̂ p , q b Q p , q b Q , this quantity can be calculated as Q ̂ p , q b = Pr ( { p , q } ) Q Q p , q b . The base pair probability P r ({p,q}), and the partition functions Q and Q p , q b are computed by means of McCaskill’s algorithm.

This part now has a complexity of O(n2D2) space and O(n3D4) time. For practical applications, however, we do not need to consider all possible {p,q}. Instead, there are only few base pairs that are likely to form and that cover v,w, especially for v,w where the internal distance of v,w is large enough such that an outside path has to be considered at all. If we assume a constant number of such long-range base-pairs, then the complexity is reduced by an n2-factor. For the complexity in terms of distance, recall that D is typically small.

Recursions for Z i , j B , v [ d , d r ]

So far, we have used Z i , j B , v [ d , d r ] as a black box. In order to compute these terms, we distinguish the limiting cases a.) v = i, b.) v = j, c.) is external from the generic case d.):

Starting from the limiting cases, we initialize Z v , j B , v [0, d r ] as follows:

Z v , j B , v [ 0 , d r ] = Z v , j I [ d r ] for a d r < b d b Z v , j I [ d ] for d r = b 0 otherwise

and analogously for Z i , v B , v [ d ,0]. Furthermore, Z i , j B , v [0,0]=0 for ivj. These conventions allow us to model all cases as special cases of d). Our key observation here is that the dependency between d and d r can be used to reduce the time complexity. Instead of using the variables d and d r in Z i , j B , v [ d , d r ], we use the pair d ,dadd in Z i , j B , v [ d , d + d add ]. Similarly, we use d , d add instead of d , d r for the inner base pair, which then determines completely the splitting the distances. This results in an recursion for Z i , j B , v [ d , d + d add ] with complexity O( n 4 D 2 c b 2 ) time and O (n2D c b ) space. To be precise, there are three sub-cases as follows.

The values that are chosen to split d and dadd are indicated in green and blue. When the arc {i,j} is colored violet, then there is a shortest path that does not use the distance marked in red but uses the other direction together with the arc {i,j}. If -b<d add <+b, then we know that neither a shortest path v p i nor v p j uses the arc {i,j}. The left distance is thus given by d - d . Using the shortcuts d r = d + dadd and d r = d + d add , then the distance between l and j must be d r - d r =( d + d add )- d + d add . If, on the other hand, dadd = + b, then we know that there is at least one shortest path that can be composed by using a shortest path vi, followed by the arc {i,j}. This of course implies that the shortest path v p j is has exactly the length d + b, or is larger. For a sub-path l+1 p j this implies that the length is greater or equal d= d r - d r = d + b - d + d add . Thus, we just have to add all partition functions Z k , j I [ d ] with d > d. This can be done efficiently by using a precalculated matrix Z i , j I [d], which is defined as d d Z i , j I [ d ]. Note that Z i , j I [d] can also be defined if we restrict in all recursion the distance d to a threshold θ d , since Z i , j I [d]= d d Z i , j I [ d ]= Q i , j - d < d Z i , j I [ d ]. In which, where Q i , j is Qi + 1,j - 1 if j > i + 1, 1 if j = i + 1 and 0 otherwise. Note, furthermore, that all Z i , j I [ d ] for d< dθ d are calculated when we restrict the distance to θ d .

Finally, if dad d= - b, then the shortest path l p j has distance ( d -b)- d + d add . For the shortest path k p i, we know that it has length d - d or greater, which can be resolved by again using Z i , k - 1 I [ d - d ]. Thus, we get the following optimized recursion for Z i , j B , v [ d , d + d add ] with d ≠ 0 and d + dadd ≠ 0:

Z i , j B , v d l , d l + d add = k l i < k v v l < j d l d l d add - b d add b Z i , k I d l - d l · Z k , l B , v d l , d l + d add · Z l , j I d l + d add - d l + d add if - b < d add < b k l i < k v v l < j d l d l d add - b d add b Z i , k I d l - d l · Z k , l B , v d l , d l + d add · Z l , j I d l - b - d l + d add if d add = b k l i < k v v l < j d l d l d add - b d add b Z i , k I d l - d l · Z k , l B , v d l , d l + d add · Z l , j I d l - b - d l + d add if d add = - b

Discussion and applications

The theoretical analysis of the distance distribution problem shows that, while polynomial-time algorithms exist, they probably cannot be improved to space and time complexities that make them widely applicable to large RNA molecules. Due to the unfavorable time complexity of the current algorithm and the associated exact implementation in C, a rather simple and efficient sampling algorithm has been implemented. We resort to sampling Boltzmann-weighted secondary structures with RNAsubopt -p[16], which uses the same stochastic backtracing approach as sfold[18]. As the graph-distance for a pair of nucleotides in a given secondary structure can be computed in O(n logn) time by Dijkstra’s algorithm with Fibonacci heap [19], even large samples can be evaluated efficiently.

As we pointed out in the introduction, the graph-distance measure introduced in this paper can serve as a first step towards a structural interpretation of smFRET data. As an example, we consider the graph distance distribution of a Diels-Alderase (DAse) ribozyme (Figure 2A). Histograms of smFRET efficiency (E fret ) for this 49 nt long catalytic RNA are reported in [12] for a large number of surface-immobilized ribozyme molecules as a function of the Mg 2+ concentration in the buffer solution. A sketch of their histograms is displayed in Figure 2B. The dyes are attached to sequence positions 6 (Cy3) and 42 (Cy5) and hence do not simply reflect the end-to-end distance, Figure 2A(c). In this example, we observe the expected correspondence small graph-distances with a strong smFRET signal. This is a particular interesting example, since the minimal free energy (mfe) structure (Figure 2A(a)) predicted with RNAfold is not identified with the real secondary structure (Figure 2A(c)). In fact, the ground state secondary structure is ranked as the 3rd best sub-optimal structure derived via RNAsubopt -e. The free energy difference between these two structures is only 0.1 kcal/mol. However, their graph-distances show a relatively larger difference. The 2nd best sub-optimal structure (Figure 2A(b)) looks rather similar with the 3rd structure, in particular, they share the same graph-distance value.

Figure 2
figure 2

Relation between graph-distance distribution and smFRET data. (A) The graph-distance distribution of a Diels-Alderase (DAse) ribozyme at temperature 37°C. Structures (a), (b) and (c) are the top three secondary structures considering their free energy: the minimum free energy structure is shown in (a), (c) is the experimentally determined secondary structure, which is ranked as the 3rd best sub-optimal structure with RNAsubopt -e. The graphic representations of these structures are produced with VARNA[20]. (B) The corresponding smFRET efficiency (E fret ) histograms are reported in [12]. From these data, three separate states of the DAse ribozyme can be distinguished, the unfolded (U), intermediate (I) and folded (F) states. (C) The graph-distance distribution in the ensemble which is approximated with RNAsubopt -p at temperature 50°C.

The smFRET data of [12] indicates the presence of three sub-populations, corresponding to three different structural states: folded molecules (state F), intermediate conformation (state I) and unfolded molecules (state U). In the absence of Mg 2+, the I state dominates, and only small fractions are found in states U and F. Unfortunately, the salt dependence of RNA folding is complex [21, 22] and currently is not properly modeled in the available folding programs. We can, however, make use of the qualitative correspondence of low salt concentrations with high temperature. In Figure 2C we therefore re-compute the graph-distance distribution in the ensemble at an elevated temperature of 50°C. Here, the real structure becomes the second best structure with free energy -10.82 kcal/mol and we observe a much larger fraction of (nearly) unfolded structures with longer distances between the two beacon positions. Qualitatively, this matches the smFRET data showed in Figure 2B.

Furthermore, for a given pair v,w of positions in a given RNA sequence x, the importance Iv,w(e) of a backbone edge or base pair e in calculating the graph-distance distribution is evaluated by I v , w (e)= e Π e Pr[G|x], where the set Π e comprises the secondary structures G with (at least) one shortest path between v and w that runs through e. Figure 3 compares dot plots of Iv,w(e) with the base-pair probabilities in the RNA structure ensemble of the DAse ribozyme at temperatures 37°C and 50°C. Since RNAgraphdist computes only one of possible many shortest paths for each G, hence we obtain only a lower bound on Iv,w(e).

Figure 3
figure 3

Comparison between the base-pair probabilities and the distance importance I 6,42 ( e ). The base-pair probabilities (upper-right-triangle) and the distance importances I6,42(e) (lower-left-triangle) of backbone edges and base pairs between 6(U) and 42(U) of DAse ribozyme (Figure 2) are computed at temperatures 37°C and 50°C, repectively. The size of the squares is proportional to the probability/value. The region covered by the between 6(U) and 42(U) is annotated by a red rectangle. For ease of comparison, backbone edges are added to the base-pair probability matrix.

We observe for DAse that the contributions from the backbone edges are larger than the base pairs at both temperatures. For T = 37°C, there are in total 14 edges with I6,42(e) > 0.4. Only two of them, 5(C)–18(G) and 2(G) – 21(C) are base pairs. For T = 50°C, there is only the pair 5(C) – 18(G) is heavily used (I6,42(5,18) = 0.636). Combining the analysis of data illustrated in Figure 2, it may indicate that the existences of two base pairs, 2(G) – 21(C) and 28(G) – 39(C) can affect the graph-distance distribution of RNA secondary structure ensemble and consequently affect smFRET measurements. Such constraints may become an interesting source of constraints for RNA structure prediction.

In addition, we compute the distribution of paths which pass through positions outside sequence interval x[ 6-h,42+h] of DAse ribozyme. As illustrated in Figure 4, this “outside-path” distribution, as expected, drops fast to 0 with respect to h.

Figure 4
figure 4

“Outside-path” distribution of DAse ribozyme. The distribution of paths which pass through positions outside the sequence interval x[6-h,42+h] of DAse ribozyme (Figure 2). As expected, this probability drops fast to 0 with respect to h.

Long-range interactions play an important role in pre-mRNA splicing and in the regulation of alternative splicing [2325], bringing splice donor, acceptor, branching site into close spatial proximity. Figure 5A shows for D. melanogaster pre-mRNAs that the distribution of graph-distances between donor and acceptor sites shifted towards smaller values compared to randomly selected pairs of positions with the same distance. Due to the insufficiency of the spacial-distance information of structural elements in the secondary structures, we artificially choose a = b = 1 in our experiments. Although the effect is small, it shows a clear difference between the real RNA sequences and artificial sequences that were randomized by di-nucleotide shuffling. Furthermore, Table 1 displays for a specific intron CG16979-RA_intron_0_0_chr3L_15569803 from Drosophila melanogaster (dm3), the most probable secondary structures in the sub-ensembles of secondary structures such that their graph-distances are 7, 6, and 14, respectively.

Figure 5
figure 5

Graph-distance distribution of the Drosophila melanogaster and the genomic RNA of human Coronavirus 229E. (A): Distribution of graph-distances (a = b = 1) in Drosophila melanogaster pre-mRNAs between the first and last intron position. To save computational resources, pre-mRNAs were truncated to 100 nt flanking sequence of introns. The black curve shows the graph-distance distribution computed for the corresponding pairs of positions on sequences that were randomized by di-nucleotide shuffling. (B): Graph-distances (a = b = 1) within and between the 5’ and 3’ regions of the genomic RNA of human Coronavirus 229E computed from a concatenation of position 1–576 (5’ UTR) and 25188–25688 (upstream of gene N). Secondary structures bring the 5’ TRS-L (63–76) and 3’ TRS-B (-23– -10) elements into close proximity.

Table 1 Graph-distance of intron CG16979-RA_intron_0_0_chr3L_15569803 from Drosophila melanogaster (dm3)

The Drosophila melanogaster Down syndrome cell adhesion molecules (DSCAM) encodes for 38.016 different mRNAs by alternative splicing. Among the 24 exons, exon 4 alone has 12 variants [26]. In Figure 6 we display the graph-distance from donor (exon 3) to any downstream position until acceptor (exon 5). Comparing the graph-distances of all twelve acceptors of exon 4, we see clearly local peaks. This suggests the acceptor being part of hairpin loops, three dimensionally poking out of the long transcript to interact easily with the spliceosome and donor. Four of the twelve acceptor sites show no local peak, however seem to be accessible as internal loops of longer hairpins.

Figure 6
figure 6

Graph-distance distribution of DSCAM. Graph-distance distribution of DSCAM from last nucleotide of exon 3 (Chr.2, Pos. 3255892) to any position until exon 5 (Chr.2, Pos. 3249372), including all 12 variations of alternative exon 4. For secondary structure prediction 100 nt flanking region were used.

The spatial organization of the genomic and sub-genomic RNAs is important for the processing and functioning of many RNA viruses. This goes far beyond the well-known panhandle structures. In Coronavirus the interactions of the 5’ TRS-L cis-acting element with body TRS elements has been proposed as an important determinant for the correct assembly of the Coronavirus genes in the host [27]. The mechanisms of interaction is unknown, and a small three-dimensional distance is suspected. The matrix of expected graph-distances in Figure 5B shows that TRS-L and TRS-B are indeed placed close to each other. In Table 2, we show the most stable structures within the sub-ensembles of secondary structures such that their graph-distances are 14, 5, and 35, respectively. All these RNA secondary structures brings the leader transcription regulation site (L-TRS) in close spatial proximity with the body transcription regulation site (B-TRS).

Table 2 Graph-distance of the genomic RNA of human Coronavirus 229E computed from a concatenation of position 1-576 and 25188-25688

These examples indicate that the systematic analysis of the graph-distance distribution both for individual RNAs and their aggregation over ensembles of structures can provide useful insights into structural influences on RNA function. These may not be obvious directly from the structures due to the inherent difficulties of predicting long-range base pairs with sufficient accuracy and the many issues inherent in comparing RNA structures of very disparate lengths.

Due the complexity of algorithm we have refrained from attempting a direct implementation in an imperative programming language. Instead, we are aiming at an implementation in Haskell that allows us to make use of the framework of algebraic dynamic programming [28]. The graph-distance measure and the associated algorithm can be extended in principle to of RNA secondary structures with additional tertiary structural elements such as pseudoknots [29] and G-quadruples [30]. RNA-RNA interaction structures [31] also form a promising area for future extensions. We note finally, that the Fourier transition method introduced in [32] could be employed to achieve a further speedup.


The distribution of spatial distances in the equilibrium structure ensemble of an RNA molecule carries information about the overall structure of the molecule. These distance can be approximated by the graph-distance in RNA secondary structure. We introduced a polynomial time algorithm to compute the equilibrium distribution of graph-distances between a fixed pair of nucleotides. For practical applications, small distances are of main interest. Here, the time complexity of the proposed algorithm is O(n4), compared to a naïve implementation with time complexity of O(n11) for sequence length n and distances that can cover the whole sequence length. Since further reductions, however, seem to be difficult, we also introduced sampling approaches that are much easier to implement. They are also theoretically favorable for several real-life applications, in particular since these primarily concern long-range interactions in very large RNA molecules.