Background

Computational methods for RNA secondary structure prediction have played an important role in unveiling the various regulatory functions of RNA. In the past four decades, these approaches have evolved from predicting a single minimum free energy (MFE) structure [1, 2] to Boltzmann sampling an ensemble of possible structures [3, 4]. Despite its success in a wide range of small RNAs, these thermodynamics-based predictions are by no means perfect.

In parallel, experiments by means of chemical and enzymatic probing have become a frequently used technology to elucidate RNA structure [5,6,7]. The basic idea of these probing methods is to use chemical probes that react differently with paired or unpaired nucleotides. The binding sites can later be detected by biochemical techniques, such as selective \(2'\)-hydroxyl acylation with primer extension (SHAPE) [7, 8], which yield reactivities at nucleotide resolution. To some extent, these reactivities provide information concerning single-stranded or double-stranded RNA regions. However, the reactivity does not unambiguously determine a specific position to be unpaired or paired [9]. While high SHAPE reactivity matches well with unpaired nucleotides, medium reactivity could correspond either to paired or unpaired nucleotides depending on various factors, such as the RNA structure itself or the experimental conditions. Recent advances focus on the development of thermodynamics-based computational tools that incorporate such experimental data as soft constraints to handle the ambiguity [7, 10, 11].

While the use of probing data has significantly improved the prediction accuracy of in silico structure prediction for several classes of RNAs [12], these methods have not solved the folding problem for large RNA systems, such as long non-coding RNAs (lncRNAs, typically 200–20k bases). The reason is that the footprinting data is one-dimensional, i.e. it does not identify base pairing partners of a given nucleotide. In particular, probing data alone cannot distinguish short-range and long-range base pairings. For long RNAs, the existence of the latter, however, has been shown experimentally [13] as well as theoretically [14, 15]. Thus, even combined with experimental data, there are still numerous RNA folds consistent with the probing data.

Based on chemical probing, Sanbonmatsu et al. [16] developed a fragmentation method for determining the secondary structure of lncRNAs in the wet lab. Their approach applies chemical probing of the entire RNA, followed by parallel probing of certain overlapping fragments. Regions of each fragment exhibiting similar probing profiles are folded independently, and combined in order to obtain the entire structure. Although the method has been successfully applied to identify the structures of several lncRNAs [16, 17], their choice of fragments is empirical, which hinders its application to longer RNA sequences [17].

In the following, as in [16], we shall stipulate \((*)\): in all probing experiments there exists a unique distinguished structure, the target. We furthermore assume that the collection of all possible structures is in thermodynamic equilibrium, i.e. a Boltzmann ensemble, and the target is contained in the ensemble. Hence, the problem of structure prediction gives rise to the following challenge:

$$\begin{aligned} {\textit{How to enhance target identification in a Boltzmann ensemble of structures?}} \end{aligned}$$
(1)

In relation to [16], our approach to Problem 1 can be understood as well as outlined as follows: Sanbonmatsu employs in parallel localization of the chemical probing experiment via fragmentation. The latter are somewhat ad hoc and almost certainly “break” any long-range base pairs, see Fig. 1.

Fig. 1
figure 1

The fragmentation of Sanbonmatsu et al. [16] (LHS) and our approach (RHS). Their approach almost certainly “breaks” long-range base pairs (dotted arcs), while ours allows bases from two non-contiguous fragments to pair

The novelty of this paper lies in a different, sequential fragmentation process, assuming \((*)\). Our input consists of the probing data of the entire sequence, giving rise to an augmented Boltzmann ensemble containing by construction the target (which of course is not known). Instead of a parallel fragmentation into subsequences, we “localize” differently, namely we successively ask whether a specific base pair is contained in the target or not. The particular base pair is identified using information theoretic properties of the Boltzmann ensemble. While we do not know the target explicitly, we can decide, with high accuracy, if it contains a particular base pair. Specifically, we cut the subsequence covered by the base pair, and glue the remainder at the cut-points. On the resulted two subsequences, our approach requires probing data to be generated from probing experiments. We then compare the additional probing data with the initial probing profile of the entire sequence. At a fundamental level, our fragmentation is different from Sanbonmatsu’s approach [16], in that we allow bases from two non-contiguous fragments to pair, see Fig. 1. As a consequence, our method is well suited to deal with the long-range base pairings, these being a prominent feature of RNA secondary structures [13, 14].

The answer to each question produces a split of the ensemble into two sub-samples, and we arrive at smaller sub-samples via successively querying and answering. We then establish that, after a few iterations, we arrive at a sample that contains a distinguished structure that, with high probability, coincides with the target. We illustrate the overall strategy in Fig. 2.

Fig. 2
figure 2

The workflow diagram (LHS) and schematic example (RHS) of our approach. LHS: we compute the ensemble tree, i.e., successively split the Boltzmann ensemble of structures into smaller sub-samples, by querying certain base pairs is contained in the target or not. RHS: the answers to the base-pair questions allow us identify a path (red) in the ensemble tree from the root to the leaf, which contains a distinguished structure

We formalize the above sequential process, by considering a variant of the Rényi-Ulam game, in which a player tries to identify an unknown object via asking yes–no questions [18, 19]. Our framework is centered around the ensemble tree, a hierarchical bi-partition of the input ensemble, whose leaves are comprised of sub-samples exhibiting a distinguished structure with high probability. Specifically, the ensemble tree is constructed by recursively querying about whether or not a base pair of maximum information entropy is contained in the target. We prove that the query of maximum entropy base pair splits the ensemble into two even parts and in addition provides maximum reduction in the entropy of the ensemble. These questions can be answered in the affirmative because of assumption \((*)\). They are answered via relating additional “local” probing data with the initial one, employing the modularity in RNA secondary structures. By this means, we identify the correct path in the ensemble tree from the root to the leaf.

The key result of this paper is that the probability of the ensemble tree correctly identifying the target in the leaf is greater than \(90\%\), for the Boltzmann ensembles from random sequences of length 300, in “Target identification” section. To demonstrate the result, we firstly utilize a q-Boltzmann sampler with signature distance filtration, which is well suited for Boltzmann ensembles subjected to the probing data constraint [7, 11], see “The Boltzmann ensemble” section. Secondly, we consider the error rates arisen from answering the queries via probing data. We show that these error rates can be significantly reduced via repeated queries in “Target identification” section. Thirdly, in “Entropy” section, we prove that the leaf with low information entropy contains a distinguished structure. We present that, once in the correct leaf, the probability the distinguished structure being identical to the target is almost always correct. Fourthly, in “Robustness” section, we analyze the robustness of our approach. We demonstrate that the ensemble tree localizing the target with high fidelity is robust, across Boltzmann samples of different sizes and nucleotide compositions.

We would point out that the q-Boltzmann sampler is only required to benchmark our approach on random sequences, due to the absence of probing data. In application scenarios where chemical probing data is provided, our approach utilizes Boltzmann ensembles with soft constraints [7].

As proof of concept, we apply our approach to natural RNAs with SHAPE probing data and compute the distinguished structure from the ensemble tree to predict the accepted secondary structure, i.e., the target. We show in “Performance comparison” section that our approach improves the average prediction accuracy by \(5\%\), compared with [8].

To summarize, the key points of our approach are:

  1. 1.

    our method is based on a Boltzmann sample and derives a sub-sample that contains the target with high probability,

  2. 2.

    the derivation is facilitated by means of the ensemble tree, and the identification of the correct path from root to leaf, is obtained by a variant of the Rényi-Ulam game,

  3. 3.

    the answers to the respective queries are inferred from chemical probing, by relating additional probing data to the initial data using modularity.

This paper is organized as follows: in “Methods” section, we introduce the main elements of our framework: the Rényi-Ulam game, the Boltzmann ensemble, base-pair queries and the ensemble tree. In “Path identification” section, we demonstrate how to integrate additional probing data with the initial ones allowing to answer the queries, thereby identifying the correct path. In “Results” section, we analyze the ensemble tree and present that our approach identifies the target reliably and efficiently. Finally, we discuss and integrate our results in “Discussion” and Conclusion” sections.

Methods

The Rényi-Ulam game

We now approach Problem 1 via the Rényi-Ulam game, a two-person game, played by a questioner (Q) and an oracle, (O). Initially O thinks of an integer, Z, between one and one million and Q’s objective is to identify Z, asking yes-no questions. O is allowed to lie at a rate specific to yes and no, respectively.

The Rényi-Ulam game has been extensively studied since the early works by Rényi and Ulam [18, 19], and has various applications such as adaptive error-correcting codes in the context of noisy communication [20, 21]. Depending on the respective application scenario, numerous variants of the Rényi-Ulam game have been considered, specifying the format of admissible queries or the way O lies [22, 23].

In what follows, we shall play the following version of the game: O holds a set of bit strings\(y_1 y_2 \ldots y_l\) of finite length l, not every bit string being equally likely selected and the queries ask for the state of the ith-bit, i.e., Q executes bit query. O’s lies occur at random, are independent and context-dependent. Specifically, O lies with probability \(e_0\) and \(e_1\) in case of the truthful answer being ”No” and “Yes”, respectively. The particular cases \(e_0=0\) and \(e_1=0\) have been studied in the context of half-lies [24].

The majority of studies on the Rényi-Ulam game to date is combinatorial. That is, they stipulate the number of lies (or half-lies) being a priori known and focus on finding optimal search strategies which uses a minimum number of queries to identify the target in all cases [23, 24].

Within the framework of this paper, we study the manifestation of the oracle, which is embodied as an indicator random variable whose distribution is derived from a modularity analysis on RNA MFE-structures, see “Path identification”. In the manifestation, erroneous responses arise intrinsically at random: either as a result of the distribution of the random variable (r.v.) or intrinsic errors of the experimental data.

By construction, this rules out a unique winning strategy for Q: instead, we consider the average fidelity or accuracy to identify the target utilizing a sub-optimal number of queries. We shall propose an entropy-based strategy: at any point a query is selected relative to the subset of bit strings coinciding with the target in all previously identified positions, that maximizes the uncertainty reduction on the subset.

The Boltzmann ensemble

At a given point in time, an RNA sequence, \({\mathbf {x}}\), assumes a fixed secondary structure, by establishing base pairings. Over time, however, \({\mathbf {x}}\) assumes a plethora of RNA secondary structures appearing at specific rates, see Appendix A for details and context on RNA. These exist in an equilibrium ensemble expressed by the partition function [3] of \({\mathbf {x}}\).

More formally, the structure ensemble, \(\Omega \) of \({\mathbf {x}}\) is a discrete probability space over the set of all secondary structures, equipped with the probability p(s) of \({\mathbf {x}}\) folding into s. We shall assume that the ensemble of structures is in thermodynamic equilibrium, the distribution of these structures being described as a Boltzmann distribution. The Boltzmann probability, p(s), of the structure s is a function of the free energyE(s) of the sequence \({\mathbf {x}}\) folding into s, computed via the Turner energy model [25, 26], see Appendix B for details. The Boltzmann probability p(s) is expressed as the Boltzmann factor \(\exp {(-E(s)/R T)}\), normalized by the partition function, \(Z=\sum _{s\in \Omega } \exp {(-E(s)/R T)}\), i.e.

$$\begin{aligned} p(s) = \frac{\exp {(-E(s)/R T)}}{Z}, \end{aligned}$$

where R denotes the universal gas constant and T is the absolute temperature. The Boltzmann distribution facilitates the computation of the partition function Z for each substructure. The partition function algorithm [3] for secondary structures computes Z and, in particular, the base pairing probabilities based on the free energies for each structure within the structure ensemble \(\Omega \).

Let \(p_{i,j}\) denote the probability of a base pairing between nucleotides i and j in the ensemble \(\Omega \). Clearly, \(p_{i,j}\) can be computed as the sum of probabilities of all secondary structures that contain (ij), that is,

$$\begin{aligned} p_{i,j} = \sum _{s\in \Omega } p(s) \delta _{i,j}(s), \end{aligned}$$

where \(\delta _{i,j}(s)\) denotes the occurrence of the base pair (ij) in s.

The thermodynamics-based partition function has been extended to incorporate chemical probing data to generate a Boltzmann ensemble, \(\Omega _{\text {probe}}\). These approaches [7, 10, 11] transform structure probing data into a pseudo energy term, \(\Delta G(s) \), which reflects how well the structure agrees with the probing data. The Turner free energy is then evaluated by adding the pseudo energy term to the loop-based energy, i.e., \(E_{\text {probe}}(s)= E(s)+ \Delta G(s) \). The corresponding equilibrium ensemble, \(\Omega _{\text {probe}}\), is distorted in favor of structures that are consistent with probing data, see Appendix C.

For sequences whose probing data are not available, we utilize the 0-1 signature of the target, which is suited for probing data, and quantify the discrepancy between the Boltzmann ensemble and the target via the signature distance \(d_{sn}\). The 0-1 signature of a structure s is a 0-1 vector with k-th tuple being 0 when the k-th base is unpaired in s, and 1 otherwise. The signature distance\(d_{\text {sn}}(s,s')\) between two structures s and \(s'\) is the Hamming distance between their corresponding 0-1 signatures, see Appendix A. We present that the average distance for an unrestricted ensemble \(\Omega \) to a random target is 0.21n, while the distance for an ensemble \(\Omega _{\text {probe}}\) incorporating simulated probing data is reduced to 0.03n, see Appendix D. This motivates us to define a q-Boltzmann ensemble, \(\Omega ^q\), which consists of structures having signature distance to the target s at most qn, i.e., \(\Omega ^q=\{ s'| d_{\text {sn}}(s',s) \le q n \}\). By construction, the computation of \(\Omega ^q\) requires the 0-1 signature of the target, and does not need experimental probing data. In particular, we present that the ensemble \(\Omega _{\text {probe}}\) has an average normalized signature distance similar to a q-ensemble having \(q=0.05\). In this paper we discuss unrestricted and restricted Boltzmann ensembles, \(\Omega \) and \(\Omega ^q\).

We shall employ greyscale diagrams in order to visualize a sample of secondary structures by superimposing them in one diagram, visualizing the base pairing probabilities. A greyscale diagram displays each base pair (ij) as an arc with greyscale \(1-p_{i,j}\), where greyscale 0 represents black and 1 represents white, see Fig. 3.

Fig. 3
figure 3

The greyscale diagram of of 1024 Boltzmann sampled structures of a random RNA sequence via ViennaRNA [12]

Instead of computing the entire ensemble, we shall consider sub-samples \(\Omega '\) consisting of N secondary structures with multiplicities of \({\mathbf {x}}\) and refer to \(\Omega '\) as the sample. For sufficiently large N (typically of around size 1000, see [4]), \(\Omega '\) provides a good approximation of the Boltzmann ensemble \(\Omega \).

A sample \(\Omega '\) is a multiset of cardinality N and for each structure s in \(\Omega '\), its multiplicity, f(s), counts the frequency of s appearing in \(\Omega '\). Thus in the context of \(\Omega '\), p(s) is given by the s-multiplicity divided by N, \(p(s)=f(s)/N\). The base pairing probability \(p_{i,j}\) has its \(\Omega '\)-analogue f(ij)/N, where f(ij) denotes the frequency of the base pair (ij) appearing in \(\Omega '\). We shall develop our framework in the context of the structure ensemble \(\Omega \), and only reference the sample \(\Omega '\), in case the results are particular to \(\Omega '\).

The bit queries

Any structure over n nucleotides is considered as a bit string of dimension \(\left( {\begin{array}{c}n\\ 2\end{array}}\right) \), stipulating (1) a structure is completely determined by the set of base pairs it contains and (2) any position can pair with any other position, except of itself.

The bit query now determines a single bit, i.e. whether or not the base pair (ij) is present in the target, stipulating that a unique target is assumed by the sequence in question. We associate the query about the target with a random variable, \(X_{i,j}\), defined on the ensemble, via questioning the presence of (ij) in each structure. By construction, the distribution of \(X_{i,j}\) is given by the base pairing probability \({\mathbb {P}}(X_{i,j}(s)=1)=p_{i,j}\).

Any base pair, (ij), has an entropy, defined by the information entropy of \(X_{i,j}\), i.e.

$$\begin{aligned} H(X_{i,j})=-p_{i,j} \log _2 p_{i,j} -(1-p_{i,j} ) \log _2 (1-p_{i,j} ), \end{aligned}$$

where the units of H are in bits. The entropy \(H(X_{i,j}) \) measures the uncertainty of the base pair (ij) in \(\Omega \) . When a base pair (ij) is certain to either exist or not, its entropy \(H(X_{i,j})\) is 0. However, in case \(p_{i,j} \) is closer to 1/2, \(H(X_{i,j})\) becomes larger.

The r.v. \(X_{i,j}\) partitions the space \(\Omega \) into two disjoint sub-spaces \(\Omega _0\) and \(\Omega _1\), where \( \Omega _k=\{s\in \Omega :X_{i,j}(s)=k\}\) (\(k=0,1\)), and the induced distributions are given by

$$\begin{aligned} p_0(s)=\frac{p(s)}{1-p_{i,j}} \quad \text { for } s\in \Omega _0,\qquad p_1(s)=\frac{p(s)}{p_{i,j}} \quad \text { for } s\in \Omega _1. \end{aligned}$$

Intuitively, \(H(X_{i,j}) \) quantifies the average bits of information we would expect to gain about the ensemble when querying a base pair (ij). This motivates us to consider the maximum entropy base pairs, the base pair \((i_0,j_0)\) having maximum entropy among all base pairs in \(\Omega \), i.e.

$$\begin{aligned} (i_0,j_0)= \mathop {\mathrm{argmax}}\limits _{(i,j)} H(X_{i,j}). \end{aligned}$$

As we shall prove in “Entropy” section, \(X_{i_0,j_0}\) produces maximally balanced splits.

The ensemble tree

Equipped with the notion of ensemble and bit query (i.e. the respective maximum entropy base pairs), we proceed by describing our strategy to identify the target structure as specified in Problem 1. The first step consists in having a closer look at the space of ensemble reductions.

Each split obtained by partitioning the ensemble \(\Omega \) using r.v. \(X_{i,j}\), can in turn be bipartitioned itself via any of its maximum entropy base pairs. This recursive splitting induces the ensemble tree, \(T(\Omega )\), whose vertices are sub-samples and in which its k-th layer represents a partition of the original ensemble into \(2^k\) blocks. \(T(\Omega )\), is a rooted binary tree, in which each branch represents a \(X_{i,j}\)-induced split of the parent into its two children.

Formally the process halts if either the resulting sub-spaces are all homogeneous, i.e. their structural entropy is 0, which means that they contain only copies of one structure, or it reaches a predefined maximum level L. In our case we set the maximum level to be \(L=\log _2 N+1=11\), that is, the height of the ensemble tree is at most 10. The procedure is described as follows:

  1. 1.

    start with the ensemble \(\Omega \).

  2. 2.

    for each space \(\Omega _{{\mathbf {k}}}\) with \(H(\Omega _{{\mathbf {k}}})>0\), where \({\mathbf {k}}\) is a sequence of 0s and 1s having length at most \(L-1=10\), compute:

    • select the maximum entropy base pair \(X_{i_{{\mathbf {k}}},j_{{\mathbf {k}}}}\) of \(\Omega _{{\mathbf {k}}}\) as the feature, i.e.

      $$\begin{aligned} X_{i_{{\mathbf {k}}},j_{{\mathbf {k}}}}= \mathop {\mathrm{argmax}}\limits _{(i,j) \text { in } \Omega _{{\mathbf {k}}}} H(X_{i,j}). \end{aligned}$$
    • split \(\Omega _{{\mathbf {k}}}\) into sub-spaces \(\Omega _{{\mathbf {k}}0}\) and \(\Omega _{{\mathbf {k}}1}\) using the feature \(X_{i_{{\mathbf {k}}},j_{{\mathbf {k}}}}\), that is, \( \Omega _{{\mathbf {k}}l }=\{s\in \Omega _{{\mathbf {k}}} :X_{i_{{\mathbf {k}}},j_{{\mathbf {k}}}}(s)=l\}\) for \(l=0,1\),

  3. 3.

    repeat Step 2 until all new sub-spaces either have structural entropy 0 or reach the maximum level 11.

figure a

In Fig. 4 we display an ensemble tree. We would remark that the ensemble tree may not be complete. The reason is that, when a sub-sample has entropy 0 and thus consists of only one structure, the splitting of this sub-sample will stop.

Clearly, for each space \(\Omega _{{\mathbf {k}}}\), the entropies of base pairs can be computed via the traversal of each bit in each structure, and the number of bit queries grows quadratically in the sequence length n. Thus finding the maximum entropy base pair can be implemented in quadratic time \(O(|\Omega _{{\mathbf {k}}}|\cdot n^2)\), with respect to the sequence length n. Since the sum of sizes of spaces on level i equals to N, each level of the ensemble tree requires \(O(N\cdot n^2)\) computations. Therefore the time complexity of Algorithm 1 is \(O(L\cdot N\cdot n^2)\), i.e., quadratic with respect to the sequence length n, assuming the maximum level L and the ensemble size N are constants. Via indexing structures in \(\Omega \) and memorizing indices in \(\Omega _{{\mathbf {k}}}\), Algorithm 1 requires O(N) memory on each level, i.e., its space complexity is \(O(L\cdot N)\).

Fig. 4
figure 4

An ensemble tree having maximum level 3. A path from the root to a leaf is identified in color red

Path identification

Given the ensemble tree, we shall construct a path recursively starting from the root to identify the leaf that contains the target. We shall do so by successive bit queries about maximum entropy base pairs, see Fig. 4.

As mentioned before, we employ two manifestations of the oracle, one using modularity based on RNA-folding, and the other determining the existence of base pairs by other experimental means.

The oracle via modularity and RNA folding

Here we shall employ modularity of RNA structures, i.e.  the loops, which constitute the additive building blocks for the free energy have only marginal dependencies. This can intuitively be understood by observing that any two loops can only intersect in at most two nucleotides, see Appendix B.

Let us introduce the notion of embedding and extraction of a contiguous subsequence or fragment, which are functions \(\epsilon _{i,j} :B^{n-m} \times B^{m} \rightarrow B^{n}\) and \(\xi _{i,j} :B^{n} \rightarrow B^{n-m} \times B^{m} \) given by

$$\begin{aligned} \epsilon _{i,j}((x_i,\dots , x_j),(y_1,\dots ,y_m))&= (y_1,\dots ,y_{i-1},x_i,\dots ,x_j,y_{i+1},\dots y_m)\\ \xi _{i,j}(x_1,\dots ,x_{n})&= ((x_i,\dots ,x_{j}),(x_1,\dots ,x_{i-1},x_{j+1},\dots ,x_{n})). \end{aligned}$$

where B denote the set of bases \( \{\mathbf{A,U,C,G } \}\), and \(j-i+1=n-m\). By construction, we have \(\epsilon _{i,j}\circ \xi _{i,j}=\text {id}\) and a contiguous subsequence or fragment of an RNA sequence is called modular if it being extracted folds into the same arc configuration as it does embedded in the sequence.

Next we show how to employ probing data to reliably answer whether or not a particular (maximum entropy) arc is contained in the target structure. Structural modularity implies that if this arc can indeed be found in the target structure, then a comparative analysis of the probing data of the entire sequence with those of the extracted sequence, as well as the remainder, concatenated at the cut points will exhibit distinctive similarity. Modularity is a decisive discriminant, if, in contrast, random fragments do not exhibit such similarity.

To quantify to what extent modularity can discriminate base pairs, we perform computational experiments on random sequences via splittings. For each sequence, we consider its MFE structure s computed via ViennaRNA [12]. We shall utilize the 0-1 signature of the MFE to mimics its probing data. Given two positions i and j, we cut the entire sequence \({\mathbf {x}}\) into two fragments, \({\mathbf {x}}_{i,j}\) and the remainder \(\bar{{\mathbf {x}}}_{i,j}\), i.e., \( \xi _{i,j}({\mathbf {x}})=({\mathbf {x}}_{i,j},\bar{{\mathbf {x}}}_{i,j})\). Subsequently, the two fragments \({\mathbf {x}}_{i,j}\) and \(\bar{{\mathbf {x}}}_{i,j}\) refold into their MFE structures \(s_{i,j}\) and \({\bar{s}}_{i,j}\), respectively, which are combined into a structure \(\epsilon _{i,j}(s_{i,j}, {\bar{s}}_{i,j})\). If bases i and j are paired in s, such a splitting is referred to as modular and the resulting structure is denoted by \(s'\). Otherwise, it is called random, with the output structure \(s''\). We proceed by computing the base-pair and signature distance from the MFE s to the structures \(s'\) or \(s''\). The base-pair distance is one of the most frequently used metrics to quantify the similarity of two different structures viewed as bit strings [27, 28], the signature distance measures the similarity between their signatures, which is well suited within the context of the probing profiles, see Appendices A and D.

In the above computation, we run through all possible positions i and j. For fragments \({\mathbf {x}}_{i,j}\) and \(\bar{{\mathbf {x}}}_{i,j}\), we compute the corresponding MFEs and distances. Accordingly, the time complexity of the computation is \(O(n^6)\), given by \(O(n^2)\) choices of indices times the \(O(n^3)\) complexity of the MFE folding and the linear time of the distance computation, where n is the sequence length.

Figure 5 (LHS) compares the distribution of the signature distances \(d_{\text {sn}}(s,s')\) and \(d_{\text {sn}}(s,s'')\) obtained from modular and random splittings, respectively. The structures induced by modular splitting have much more similar signatures to their MFE structures, than those induced by random splitting. The situation is analogous for base-pair distances, see Fig. 5 (RHS). Since these distances measure structural similarity, the data also indicates that, when i and j form a base pair in s, the fragment \({\mathbf {x}}_{i,j}\) is more likely to fold into the same configuration as it does being embedded, i.e. \({\mathbf {x}}_{i,j}\) is modular.

Fig. 5
figure 5

The distributions of the signature distances (LHS) \(d_{\text {sn}}(s,s'),d_{\text {sn}}(s,s'')\) and the base-pair distances (RHS) \(d_{\text {bp}}(s,s'),d_{\text {bp}}(s,s'')\) obtained from modular splitting (blue) and random splitting (orange). We generated 8000 random sequences \({\mathbf {x}}\) of length 500, and computed their structures s (MFE). For any two positions i and j, we compute \(s'\) (modular), \(s''\) (random) via ViennaRNA [12]. The red dashed line (left) denotes “threshold distance”, 31 (see main text)

The data displayed in Fig. 5 suggests the threshold distance, \(\theta \), for signatures, by which we distinguish modular from random. More specifically, if the signature distance is smaller than \(\theta \), we predict that bases i and j are paired. Otherwise they are unpaired. In order to quantify the accuracy of this classification, we consider the resulting false discovery rate (FDR) and false omission rate (FOR).Footnote 1 In our Rényi-Ulam game variation, the expected values of FDR and FOR are the error rates \(e_1\) and \(e_0\) in case the truthful answer being yes and no, respectively. Figure 6 displays the error rates \(e_0\) and \(e_1\) as functions of \(\theta \). For \(\theta =31\), we compute \(e_0\approx 0.052\) and \(e_1 \approx 0.007\), i.e.  we have an error rate of 0.052 for rejecting and an error rate of 0.007 for confirming a base pair.

Fig. 6
figure 6

The error rates \(e_0\) and \(e_1\) as a function of the threshold \(\theta \). We use the same sequences and structures as described in Fig. 5. When the signature distance is smaller than \(\theta \), we predict that bases i and j are paired, otherwise they are unpaired. The error rates \(e_0\) and \(e_1\) are given by the false omission rate (FOR) and false discovery rate (FDR), respectively

A new fragmentation

Equipped with the ensemble tree and the oracle via modularity, our framework provides a fragmentation process combining “local” probing profiles with the “global” one via modularity. The novel fragmentation process is guided by the base-pair queries of the ensemble tree inferred from the restricted Boltzmann sample incorporating chemical probing. Given the maximum entropy base pair, (ij), extraction splits the sequence into two fragments, one being the extracted fragment \({\mathbf {x}}_{i,j}\) and the other, \(\bar{{\mathbf {x}}}_{i,j}\). We perform probing experiments on these two segments, and obtain the reactive probabilities \({\mathbf {q}}_{i,j}\) and \(\bar{{\mathbf {q}}}_{i,j}\), respectively. Let \({\mathbf {q}}\) be the reactive probability for the entire sequence, and \({\mathbf {q}}'\) be the embedding of \({\mathbf {q}}_{i,j}\) into \(\bar{{\mathbf {q}}}_{i,j}\), i.e., \({\mathbf {q}}'= \epsilon _{i,j} ({\mathbf {q}}_{i,j}, \bar{{\mathbf {q}}}_{i,j})\). As shown in the previous subsection, if the Hamming distance \(d({\mathbf {q}},{\mathbf {q}}') \) is smaller than threshold \(\theta \), then the probing profiles are similar, i.e., two bases i and j are paired. Otherwise, they are unpaired in the target structure.

The fragmentation procedure can be summarized as follows:

  1. 1.

    a probing experiment for the entire sequence is performed and the reactive probability \({\mathbf {q}}\) is obtained,

  2. 2.

    a Boltzmann sample \(\Omega _{\text {probe}}\) of N structures, consistent with the probing data \({\mathbf {q}}\) is computed,

  3. 3.

    the ensemble tree \(T(\Omega )\) containing the sub-spaces \(\Omega _{{\mathbf {k}}}\) and the corresponding maximum entropy base pairs \(X_{i_{{\mathbf {k}}},j_{{\mathbf {k}}}}\) is constructed,

  4. 4

    starting with \(\Omega \) we recursively answer the queries, determining thereby a path through the ensemble tree from the root to a leaf.

  5. 5

    once in a leaf, Proposition 1 guarantees the existence of a distinctive structure which we stipulate to be the target structure.

Figure 7 demonstrates the workflow of the fragmentation process, which can be considered as an implementation of our overall strategy in Fig. 2 (LHS), via incorporating the new fragmentation process into the workflow.

Fig. 7
figure 7

The workflow diagram of our fragmentation process

We would point out that the key of path identification is determination of base pairs. Instead of using modularity and “local” probing data, we can also apply other experimental approaches to identifying base pairs. In Appendix E, we summarize state-of-the-art experimental approaches that could possibly be utilized to determine base pairs and to identify the path in the ensemble tree. In particular, we detail two methods, both of which utilize chemical probing data in different ways than our fragmentation [29, 30] and recover base pairs with a false discovery rate less than 0.05.

Results

Given an input sample \(\Omega \), we construct the ensemble tree \(T(\Omega )\) having maximum level \(L=11\), recursively computing the maximum entropy base pairs as described in Algorithm 1. In this section, we shall analyze the entropy of leaves in order to quantify the existence of a distinguished structure and to identify the target.

Entropy

To quantify the uncertainty of an ensemble, we define the structural entropy of an ensemble, \(\Omega \), of an RNA sequence, \({\mathbf {x}}\), as the Shannon entropy

$$\begin{aligned} H(\Omega ) = -\sum _{s\in \Omega } p(s) \log _2 p(s), \end{aligned}$$

the units of H being bits. The sum is taken over all secondary structures s of \({\mathbf {x}}\), and p(s) denotes the Boltzmann probability of the structure s in the ensemble \(\Omega \). The notion of structural entropy is originated in thermodynamics and is usually regarded as a measure of disorder, or randomness of an ensemble [31, 32].

Given a sample \(\Omega '\) of size N, the structural entropy has the upper bound \( \log _2 N\), that is, \(H_{}(\Omega ')\) reaches its maximum when all sampled structures are different. Throughout the paper, we assume \(N=1024\) and therefore \(H_{}(\Omega ')\le 10\).

Proposition 1

Let\(\Omega '\) be a sample having structural entropyE, where\(0\le E \le 1\). Then there exists one structure in\(\Omega '\)having probability at leastf(E), wheref(E) is the solution of the equation

$$ -p \log _2 p -(1-p) \log _2 (1-p)=E $$

satisfying\(0.5\le p \le 1\). In particular, we have\(f(1)=0.5\), \(f(0.469) \approx 0.9\)and\(f(0.286) \approx 0.95\), see Fig. 8.

Fig. 8
figure 8

A sample with structural entropy E contains a distinguished structure having probability at least f(E)

Proposition 1 implies that a sample with small structural entropy contains a distinguished structure and a proof is given in Appendix F. We refer to a sample having a distinguished structure of probability at least \(\lambda \) as being \(\lambda \)-distinguished.

Next we quantify the reduction of a bit query on an ensemble. Recall that the associated r.v. \(X_{i,j}\) of a base pair (ij) partitions the sample \(\Omega \) into two disjoint sub-samples \(\Omega _0\) and \(\Omega _1\), where \( \Omega _k=\{s\in \Omega :X_{i,j}(s)=k\}\) (\(k=0,1\)).

The conditional entropy, \(H(\Omega |X_{i,j})\), represents the expected value of the entropies of the conditional distributions on \(\Omega \), averaged over the conditioning r.v. \(X_{i,j}\) and can be computed by

$$\begin{aligned} H(\Omega |X_{i,j})= (1-p_{i,j}) H(\Omega _0)+ p_{i,j} H(\Omega _1). \end{aligned}$$

Then the entropy reduction\(R(\Omega ,X_{i,j})\) of \(X_{i,j}\) on \(\Omega \) is the difference between the a priori Shannon entropy \(H(\Omega )\) and the conditional entropy \(H(\Omega |X_{i,j}) \), i.e.

$$\begin{aligned} R(\Omega ,X_{i,j})= H(\Omega )-H(\Omega |X_{i,j}). \end{aligned}$$

The entropy reduction quantifies the average change in information entropy from an ensemble in which we cannot tell whether or not a certain structure contains (ij), to its bipartition where one of its two blocks consists of structures that contain (ij) and the other being its complement.

Proposition 2

The entropy reduction \(R(\Omega ,X_{i,j})\) of \(X_{i,j}\) is given by the entropy \(H(X_{i,j})\) of \(X_{i,j}\) , i.e.

$$\begin{aligned} R(\Omega ,X_{i,j})= H(X_{i,j}). \end{aligned}$$
(2)

Proposition 2 queries a Bernoulli random variable inducing a split, reducing its average conditional entropy exactly by the entropy of the random variable itself. In the context of the Rényi-Ulam game, Q asks a question that helps to maximally reduce the space of possibilities. A proof of Proposition 2 is presented in Appendix G.

The next observation shows that querying maximum entropy base pairs, induces a best possible balanced split of the ensemble.

Proposition 3

Suppose that\(X_{i,j}\)induces a partition of the ensemble\(\Omega \)into sub-samples\(\Omega _0^{i,j}\)and\(\Omega _1^{i,j}\). Let\((i_0,j_0)\)be a maximum entropy base pair of\(\Omega \). Then we have

  1. 1.

    \((i_0,j_0)\)minimizes the difference of the probabilities of the two sub-samples,

    $$\begin{aligned} |{\mathbb {P}}(\Omega _0^{i_0,j_0}) -{\mathbb {P}}(\Omega _1^{i_0,j_0})|\le |{\mathbb {P}}(\Omega _0^{i,j}) -{\mathbb {P}}(\Omega _1^{i,j})|, \end{aligned}$$

    for any (ij). Here we define\({\mathbb {P}}(\Omega _k^{i,j}) = {\mathbb {P}}(s\in \Omega :X_{i,j}(s)=k)\)with\(k=0,1\).

  2. 2.

    \((i_0,j_0)\)maximizes the entropy reduction\(R(\Omega ,X_{i,j})\)of\(X_{i,j}\)on\(\Omega \),

    $$\begin{aligned} R(\Omega ,X_{i_0,j_0}) \ge R(\Omega ,X_{i,j}), \end{aligned}$$

    for any (ij).

Proposition 3 first shows that the bit query about the maximum entropy base pair \(X_{i_0,j_0}\) partitions the ensemble as balanced as possible, i.e. into sub-samples having the minimum difference of their probabilities. It furthermore establishes that the splits have minimum average structural entropy (or uncertainty), since \(X_{i_0,j_0}\) provides the maximum entropy reduction on the ensemble. Thus the query about \((i_0,j_0)\) is the most informative among all bit queries.

Finally we quantify the average entropy of sub-samples, \(\Omega _{t}\), on the t-th level of the ensemble tree, and establish the existence of a distinguished structure. The analysis of entropies depends of course on the way the samples are being constructed. To this end, given a random sequence, we construct the ensemble tree for two types of samples, one being unrestricted samples of structures, \(\Omega \), and the other utilizing q-Boltzmann sampling that incorporates the signature of the target, \(\Omega ^q\), see “The Boltzmann ensemble” section. Specifically, the target structure is randomly selected from the unrestricted sample, and the q-Boltzmann sample utilizes the 0-1 signature of the target. We would point out that our framework does not require to ”choose” the target for a sequence, and here we make the choices to facilitate the computation on random sequences.

For unrestricted Boltzmann samples, the structural entropy \(H(\Omega _{t})\) of sub-samples on the t-th level decreases, as the level t increases, see Fig. 9. In particular, the average entropy \(H(\Omega _{11})\) of leaf samples is 0.328 and 0.147, for sequences having 200 and 300 nucleotides, respectively. Proposition 1 guarantees that the leaf \(\Omega _{11}\) is 0.90-distinguished, i.e. containing a distinguished structure with ratio at least 0.90 for sequences of length 200, and 0.95-distinguished for sequences of length 300.

Fig. 9
figure 9

The average entropy of sub-samples \(H(\Omega _{t})\) on the t-th level. We randomly generate 1000 sequences of length 200, and sample \(2^{10}\) structures together with a target structure s for each sequence

For q-Boltzmann samples \(\Omega ^q\) of structures having signature distance to the target s at most qn, the small entropy of the leaf and the high ratio of the distinguished structure are robust over a range of q-values, see Fig. 10. We also observe that, for longer sequences, the entropy is smaller, and therefore the ratio of the distinguished structure is higher.

Fig. 10
figure 10

The structural entropy \(H(\Omega ^q_{11})\) of the leaf sub-samples for different q-values. We randomly generate 1000 sequences of length 100, 200 and 300. For each sequence, we then generate a q-Boltzmann sample \(\Omega ^q\) of \(2^{10}\) structures together with a target s. The red dashed line denotes q-samples having \(q=0.05\), which is tantamount to Boltzmann samples \(\Omega _{\text {probe}}\) incorporating the probing data via pseudo-energies

Target identification

Any leaf of the ensemble tree exhibiting a structural entropy less than one, contains, by Proposition 1, a distinguished structure. Successive queries produce a unique, distinguished leaf, \(\Omega ^*\) which, with high probability, contains structures that are compatible with the queries. Let \(s^*\) be the distinguished structure in \(\Omega ^*\), and s denote the target.

In this section, we shall analyze this probability, \({\mathbb {P}}(s\in \Omega ^*)\), as well as \({\mathbb {P}}(s^*=s)\) and \({\mathbb {P}}(s^*=s\mid s\in \Omega ^*)\), see Table 1. For the path identification to the leaf \(\Omega ^*\), we consider the error rates \(e_0=0.05\) and \(e_1=0.01\) computed in “Path identification” section.

As detailed in “Path identification” section, these probabilities depend on the error rates \(e_0\) and \(e_1\), and since these errors occur independently, we derive \({\mathbb {P}}(s\in \Omega ^* )=(1-e_0)^{l_0} (1-e_1)^{l_1}\), where \(l_0\) and \(l_1\) denote the number of No-/Yes-answers to queried base pairs along the path, respectively. Figure 11 displays the distribution of \(l_1\). We observe that \(l_1\) has a mean around 5, i.e., the probabilities of queried base pairs being confirmed and being rejected are roughly equal. For \(l_0=l_1=5\), we have a theoretical estimate \({\mathbb {P}}(s\in \Omega ^* )\approx 0.736\). In Fig. 12 we present that \({\mathbb {P}}(s\in \Omega ^* )\) decreases as the error rate \(e_0\) increases, for fixed \(e_1=0.01\).

Fig. 11
figure 11

The distributions of \(l_1\), the number of queried base pairs on the path that are confirmed by the target structure. We generate unrestricted Boltzmann samples for random sequences of different lengths

Fig. 12
figure 12

The probability \({\mathbb {P}}(s\in \Omega ^* )\) as a function of the error rate \(e_0\), for fixed \(e_1=0.01\) and \(l_0=l_1=5\)

For (unrestricted) Boltzmann samples generated from random sequences, we present the probability \({\mathbb {P}}(s\in \Omega ^*)\) of the leaf containing the target is greater than \(74\%\), which agrees with the above theoretical estimate. Note that this amounts to having no probing data as a constraint for the sampled structures, a worst case scenario, so to speak.

Table 1 Key observables

Furthermore, the probability that the distinguished structure is identical to the target is approximately unchanged, see Table 2. \({\mathbb {P}}(s^*=s\mid s\in \Omega ^*)\) indicates, that once we are in the correct leaf, the chance of correctly identifying the target increases to \(94\%\) for sequences of length 300. Accordingly, the key factor is the correct identification of the leaf \(\Omega ^*\).

Table 2 Target identification: we randomly generate 1000 sequences of length n and Boltzmann sample \(2^{10}\) structures together with a target structure s for each sequence

For q-Boltzmann samples \(\Omega ^q\) filtered by signature distance \(\le q n\) we observe the following: the probability \({\mathbb {P}}(s\in \Omega ^{*} )\) of the leaf to contain the target is greater than \(70\%\) is robust over a range of q-values, see Fig. 13. As expected, as q increases, the probability of the target being in the correct leaf decreases, due to the fact that the q-samples become less constraint by the probing data.

In particular, we observe that, for \(q=0.05\) and sequences of length 300, the probability of the ensemble tree identifying the correct leaf is greater than \(90\%\), see Fig. 13 (red dashed line). As the Boltzmann ensembles incorporation of probing data via pseudo-energies result in a q-value of 0.05, this translates into \({\mathbb {P}}(s\in \Omega ^{*})\ge 90\%\) for such ensembles generated by such restricted Boltzmann samplers for sequences of length 300.

We demonstrate that the ensemble tree localizing the target with high fidelity is robust, across samples of sequences having various lengths and different signature filtration q. Figure 14 (LHS) shows that the ensemble tree for longer sequences has a higher chance of identifying the target. Once we are in the correct leaf, the chance of correctly distinguishing the target significantly increases, from around \(75\%\) to over \(94\%\) in the case of sequences having 200 nucleotides, see Fig. 14 (RHS).

Fig. 13
figure 13

The probability \({\mathbb {P}}(s\in \Omega ^{*} )\) of being in the correct leaf for different q-values. We use the same sequences and q-Boltzmann samples as described in Fig. 10. Error bars show the standard deviations for random sequence samplings in the corresponding cases. The red dashed line denotes q-samples having \(q=0.05\), which is tantamount to Boltzmann samples \(\Omega _{\text {probe}}\) incorporating the probing data via pseudo-energies

Fig. 14
figure 14

The probabilities \({\mathbb {P}}(s^*=s)\) (LHS) and \({\mathbb {P}}(s^*=s\mid s\in \Omega ^*)\) (RHS) of correctly identifying the target, either in general or conditioning on being in the correct leaf. We use the same sequences and q-Boltzmann samples as described in Fig. 10. Error bars show the standard deviations for random sequence samplings in the corresponding cases. The red dashed line denotes q-samples having \(q=0.05\), which is tantamount to Boltzmann samples \(\Omega _{\text {probe}}\) incorporating the probing data via pseudo-energies

As mentioned above, the key is the correct identification of the leaf containing the target, and its distinguished structure to coincide with the latter. These events are quantified via \({\mathbb {P}}(s\in \Omega ^{*} )\) and \({\mathbb {P}}(s^*=s)\), which depend on the error rates \(e_0\) and \(e_1\).

These error rates can be reduced by asking the same query repeatedly. In our Rényi-Ulam game, repeating the same query is tantamount to performing the same experiment multiple times. It is reasonable to assume that experiments are performed independently and thus errors occur randomly. Intuitively, repeated experiments reduce errors originated from the noisy nature of experimental data. Utilizing Bayesian analysis, we show that, if we get the same answer to the query twice, the error rates would become significantly smaller, for example, \(e_0^{[2]} =0.003\) and \(e_1^{[2]} =0.00005\), see Appendix H.

In principle, we can reduce the error rates by repeating the same query k times. The error rates would approach to 0 as k grows to infinity. In this case, \({\mathbb {P}}(s\in \Omega ^{*} ) \approx 1\), i.e. the leaf always contains the target. The fidelity of the distinguished structure \({\mathbb {P}}(s^{*}=s ) \) increases from 70 to \(94\%\) for sequences of length 300.

Robustness

A significant advantage of our approach is the robustness of target identification across Boltzmann samples of various sizes and different GC-contents. For sizes, we deliberately change the number N of sampled structures ranging from \(2^9\) to \(2^{11}\). Accordingly, the maximum level L of the ensemble tree varies, i.e., it grows at a logarithmic scale \(L=\log _2 N+1\). For GC-contents, we utilize sequences with different nucleotide compositions to generate Boltzmann samples. It is believed that medium or low GC-content offers greater transcription efficiency, while high GC-content provides better structural stability [33]. We thus consider GC-rich and AU-rich random sequences. For each variant sample, we compute the distinguished structure \(s^*\) in the leaf \(\Omega ^*\) via successive L queries, and presents the probabilities of the distinguished structure being identical to the target, see Tables 3 and 4.

In Table 3, we demonstrate that the ensemble tree localizing the target with high fidelity is robust, across unrestricted samples of various sizes. Additionally, Table 3 shows that the ensemble tree for longer sequences has a higher chance of identifying the target, see the probabilities displayed in italics. We also observe that the probability of being in the distinguished leaf, \({\mathbb {P}}(s\in \Omega ^*)\), slightly decreases, as the sample size increases. This can be improved by repeating the same query multiple times as shown in “Target identification” section.

Table 4 shows the robustness of target identification across samples of different GC-contents. This indicates that the effectiveness of our approach remains unaffected by RNA sequences with various GC-contents.

Table 3 The robustness of target identification across samples of various sizes
Table 4 The robustness of target identification across samples of different GC-contents

Performance comparison

Here we apply our approach to natural RNAs, and compare the performance with the RNA structure modeling method developed by Hajdin et al. [8]. First, we use the data set of 18 RNAs with published SHAPE profiles and accepted secondary structures [8]. This data set includes tRNAs, ribosomal RNAs, riboswitches, and viruses. RNA lengths vary from 34 to 530 nucleotides, see Table 5. We consider the accepted secondary structure excluding pseudoknots as the target. Specifically, we reduce a pseudoknot by removing the helix having the minimum size in the pseudoknot. Then, for each sequence, we incorporate chemical probing data as pseudo energies [7] and generate a Boltzmann sample \(\Omega _{\text {probe}}\) of \(2^{10}\) structures, see “The Boltzmann ensemble” section. We compute the ensemble tree for each sample \(\Omega _{\text {probe}}\), and identify the distinguished leaf \(\Omega ^*\) via successive base-pair queries on the target. For each base-pair query, we consider the error rates \(e_0=0.05\) and \(e_1=0.01\) computed in “Path identification” section. We output the distinguished structure \(s^*\) in the leaf as the predicted structure.

Here we drop the assumption that the target is always in the sample. We consider the model-agnostic property of our approach, i.e., whether it guarantees to find the “best” structure in the sample even if the correct target is not contained. Also, we point out that our framework does not require a priori knowledge of the target structure. However, because we are in lack of probing data on fragments to confirm or reject certain base-pairs using modularity, we need to utilize knowledge of the target to answer the queries.

Nevertheless, as a proof of concept, we present the computational results of our approach applying to the data set. To compare with [8], we compute three measures of performance: sensitivities, positive predictive value (PPV), and accuracy. We show that the accuracy of our method is, on average, 5 percentage points higher than that of [8]. Moreover, the improvement on the target identification accuracy is robust, across sequences of different types and various lengths. Although the sample \(\Omega _{\text {probe}}\) does not contain the target structure for sequences longer than 76, the results demonstrate that our approach is capable of identifying the “best” structure, which is defined as the one in the sample having the smallest base-pair distance to the target.

Table 5 Target identification results for 18 test sequences with SHAPE profiles from [8]

Discussion

In our framework, the key factor is the correct identification of the leaf that contains the target. Figure 15 displays the average base-pair distances \(d_{\text {bp}}(s,\Omega _{t})\)Footnote 2 between the target structure s and the t-th sub-sample \(\Omega _{t}\) on the path. We contrast three scenarios, first the expectation being taken over all ensemble trees (blue), the set of ensemble trees in which the leaf containing the target is identified (green) and its complement (orange). We here present that the correct identification of the leaf containing the target significantly reduces the distance between the target and the sub-samples.

Fig. 15
figure 15

The average base-pair distance \(d_{\text {bp}}(s,\Omega _{t})\) between the target s and the sub-sample \(\Omega _{t}\) on the path. The expectation is taken over all ensemble trees (blue), the set of ensemble trees in which the leaf containing the target is identified (green) and its complement (orange). The computation is based on the Boltzmann samples of sequences of length 300

Our framework is based on two assumptions. The first is sampling from the Boltzmann ensemble of structures. This assumption is important, as for an arbitrary sample, the splittings could be highly unbalanced and the leaf of the ensemble tree does not always contain a distinguished structure. By quantifying the distinguished structure via the flow of entropies of sub-samples on the path, we contrast three classes of samples, the first being a Boltzmann sample (B-sample), the second a uniform sample (U-sample) and the third an E-sample,Footnote 3 see Fig. 16. We present that, in a Boltzmann sample, the entropies of sub-samples on the t-th level decrease much more sharply than those in the latter two classes, see Fig. 16 (LHS). In particular, the latter two produce leaves exhibiting an average entropy greater than 1, i.e. not containing a distinguished structure. As proved in Proposition 2, the entropy reduction equals to the entropy of the queried base pair. Figure 16 (RHS) explains the reason for the significant reduction, that is, the maximum entropy base pairs in Boltzmann samples have entropy close to 1 on each level, implying that the bit queries split the ensemble roughly in half each time. The latter two types of samples do not exhibit this phenomenon. In upcoming work, we shall investigate this phenomenon via quantifying how the uncertainty or entropy of the ensemble is distributed in the bit queries.

Fig. 16
figure 16

The average entropy of sub-samples \(H(\Omega _{t})\) (LHS) and queried base pairs \(H(X_{t})\) (RHS) on the t-th level of the ensemble tree. We contrast the ensemble trees obtained from a Boltzmann sample (B, blue), a uniform sample (U, orange), or an E-sample (E, green), which is comprised of \(2^{10}\) distinct structures, each containing only one base pair. For the former two types of samples, we randomly generate 1000 sequences of length 200. For each sequence, we sample \(2^{10}\) structures together with a target structure s, according to the Boltzmann or uniform distributions

The second assumption is that the target is contained in the sample. This assumption can be validated by generating samples of larger size, and checking whether or not the distinguished structure is reproducible. We would remark that, even though the probability of identifying the target \({\mathbb {P}}(s^*=s)\) slightly decreases for larger samples (Table 3), we can significantly improve \({\mathbb {P}}(s^*=s)\) by performing the same experiment multiple times, see “Target identification” and Appendix H.

Accordingly, the probability and entropy of a base pair is calculated in the context of the entire ensemble, and thus the ensemble tree together with maximum entropy base pairs. [32] show that the structural entropy of the entire Boltzmann ensemble is asymptotically linear in n, i.e. \(H(\Omega _{\text {entire}})\approx 0.07 n\). Since each queried base pair reduces the entropy by approximately 1 and the reduction is additive by construction, the ensemble tree would require approximately 0.07n queries to identify a leaf that has entropy smaller than 1 and contains a distinguished structure.

For a sample of RNA pseudoknotted structures, the ensemble tree in our framework can still be computed. However, the structure modularity no longer holds in the pseudoknot case. The reason is that a pseudoknot loop could intersect in more than one base pair with other loops, see Fig. 17 (RHS). The fragmentation with respect to a base pair involved in a pseudoknot could affect several loops, each contributing to the free energy. The change of loop-based energy could lead to splits folding into a different configuration compared to the full transcript. Nevertheless, it would be interesting to find out other experimental methods to facilitate our framework for RNA pseudoknotted structures.

Conclusion

In this paper we propose to enhance the method of identifying the target structure based on RNA probing data. To facilitate this we introduce the framework of ensemble trees in which a sample derived from the partition function of structures is recursively split via queries using information theory. Each query is answered based on either RNA folding data in combination with chemical probing, employing modularity of RNA structures, or, alternatively, directly using experimental methods [29, 30]. The former type of inference can be viewed as a kind of localization of probing data, relating local to global data by means of structural modularity. We show that within this framework it is possible to identify the target with high fidelity and that this identification requires a small number of base pairs to be queried. In particular we present that, for the Boltzmann ensembles incorporating probing data via pseudo-energies, the probability of the ensemble tree identifying the correct leaf that contains the target is greater than \(90\%\), see “Target identification” section.