## Abstract

The study of native motifs of RNA secondary structures helps us better understand the formation and eventually the functions of these molecules. Commonly known structural motifs include helices, hairpin loops, bulges, interior loops, exterior loops and multiloops. However, enumerative results and generating algorithms taking into account the joint distribution of these motifs are sparse. In this paper, we present progress on deriving such distributions employing a tree-bijection of RNA secondary structures obtained by Schmitt and Waterman and a novel rake decomposition of plane trees. The key feature of the latter is that the derived components encode motifs of the RNA secondary structures without pseudoknots associated with the plane trees very well. As an application, we present an algorithm (*RakeSamp*) generating uniformly random secondary structures without pseudoknots that satisfy fine motif specifications on the length and degree of various types of loops as well as helices.

This is a preview of subscription content, access via your institution.

## Data Availability

The software *RakeSamp* and its source code are freely available at Github: https://github.com/RickyXFChen/RakeSamp.

## References

Chen RXF (2019) A new bijection between RNA secondary structures and plane trees and its consequences. Electron J Combin 26(4):4–48

Chen WYC (1990) A general bijective algorithm for trees. Proc Natl Acad Sci USA 87:9635–9639

Clote P (2006) Combinatorics of saturated secondary structures of RNA. J Comp Biol 13:1640–1657

Clote P, Ponty Y, Steyaert JM (2012) Expected distance between terminal nucleotides of RNA secondary structures. J Math Biol 65:581–599

Došlić T, Svrtan D, Veljan D (2004) Enumerative aspects of secondary structures. Discrete Math 285(2004):67–82

Ding Y, Lawrence CE (2003) A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res 31:7280–7301

Duchon P, Flajolet P, Louchard G, Schaeffer G (2004) Boltzmann samplers for the random generation of combinatorial structures. Combin Probab Comput 13:577–625

Fontana W, Konings D, Stadler PF, Schuster P (2004) Statistics of RNA secondary structures. Biopolymers 33:1389–1404

Hofacker IL, Schuster P, Stadler PF (1998) Combinatorics of RNA secondary structures. Discrete Appl Math 88:207–237

Hofacker IL (2003) Vienna RNA secondary structure server. Nucleic Acids Res 31:3429–3431

Heitsch C, Poznanović S (2014) Combinatorial insights into RNA secondary structure. In: Jonoska N, Saito M (eds) Discrete and topological models in molecular biology. Springer, pp 145–166

Jühling F, Mörl M, Hartmann RK, Sprinzl M, Stadler PF, Pütz J (2009) tRNAdb 2009: compilation of tRNA sequences and tRNA genes. Nucleic Acids Res 37:D159–D162

Knuth D (1997) The Art of Computer Programming, Vol. 2 (3rd Ed.) Addison-Wesley Longman: Boston

Poznanović S, Heitsch C (2014) Asymptotic distribution of motifs in a stochastic context-free grammar model of RNA folding. J Math Biol 69:1743–1772

Han HSW, Reidys CM (2012) The \(5^{\prime }\)-\(3^{\prime }\) distance of RNA secondary structures. J Comp Biol 19:867–878

Klazar M (1998) On trees and noncrossing partitions. Discrete Appl Math 82:263–269

Lorenz W, Ponty Y, Clote P (2008) Asymptotics of RNA shapes. J Comp Biol 15:31–63

Mathews DH, Sabina J, Zuker M, Turner DH (1999) Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J Mol Biol 288:911–940

Nebel ME (2003) Combinatorial properties of RNA secondary structures. J Comp Biol 9(3):541–574

Nebel ME, Scheid A, Weinberg F (2011) Random generation of RNA secondary structures according to native distributions. Algorithms Mol Biol 6:24

Ponty Y (2008) Efficient sampling of RNA secondary structures from the Boltzmann ensemble of low-energy: the boustrophedon method. J Math Biol 56:107–127

Schmitt WR, Waterman MS (1994) Linear trees and RNA secondary structure. Discrete Appl Math 51(3):317–323

Smith TF, Waterman MS (1978) RNA secondary structure. Math Biol 42:31–49

Stein PR, Waterman MS (1979) On some new sequences generalizing the Catalan and Motzkin numbers. Discrete Math 26:261–272

Simion R (2000) Noncrossing partitions. Discrete Math 217:367–409

Sloma MF, Mathews DH (2016) Exact calculation of loop formation probability identifies folding motifs in RNA secondary structures. RNA 22:1808–1818

Sato K, Akiyama M, Sakakibara Y (2021) RNA secondary structure prediction using deep learning with thermodynamic integration. Nat Comm 12:941

Staple DW, Butcher SE (2005) Pseudoknots: RNA Structures with Diverse Functions. PLOS Biol 3(6):e213

Tuerk C, MacDougal S, Gold L (1992) RNA pseudoknots that inhibit human immunodeficiency virus type 1 reverse transcriptase. Proc Natl Acad Sci USA 89(15):6988–6992

Waterman MS (1978) Secondary structure of single-stranded nucleic acids. In: Rota G-C (ed) Studies on foundations and combinatorics, Advances in mathematics supplementary studies. Academic Press, New York, pp 167–212

Waterman MS (1979) Combinatorics of RNA hairpins and cloverleaves. Stud Appl Math 60:91–98

Zuker M (1989) On finding all suboptimal foldings of an RNA molecule. Science 244:48–52

Zuker M, Sankoff D (1984) RNA secondary structures and their prediction. Bull Math Bio 46:591–621

Zuker M (2003) Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res 31:3406–3415

## Acknowledgements

We are grateful for the valuable comments and suggestions of the anonymous referees which improved the presentation of the work. The first author also acknowledges the support by the Yellow Mountain Distinguished Scholar Program at the Hefei University of Technology.

## Author information

### Authors and Affiliations

### Corresponding author

## Ethics declarations

### Conflicts of interest:

none.

### Funding

The first author was supported by the Anhui Provincial Natural Science Foundation of China (No. 2208085MA02) and Overseas Returnee Support Project on Innovation and Entrepreneurship of Anhui Province (No. 11190-46252022001).

### Availability of data and material:

not applicable.

### Code availability:

RakeSamp is available on Github https://github.com/RickyXFChen/RakeSamp.

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendix Proofs

### Appendix Proofs

### Proof of Lemma 2.1

Note that in a finite plane tree, there always exists an internal vertex *u* whose children are all leaves. Starting with *u* and traveling along the unique path from *u* to the root (which could be *u* itself), the first encountered vertex that has siblings, *v*, is a companion that determines a rake. In case no such *v* exists, the root represents the companion that determines a rake.

It is clear that deleting all descendants of *v* and their incident edges will not affect any other remaining vertex’s attribute of being a companion or not. It may affect whether or not such a companion determines a rake after the deletion. Furthermore, *v* itself is a leaf in \(T'\), and by definition, none of the *v*-descendants are companions in *T*, whence we have \(m-1\) companions in \(T'\).

### Proof of Lemma 3.1

Taking the auxiliary arc \((0, 2b+k+1)\) into consideration, there is at least one arc covering *u*. Then, by construction of the Schmitt–Waterman bijection, *u* is a child of the innermost arc *v* covering *u*. Since *u* is the outermost arc of a helix and is not \((1, 2b+k)\), *v* must directly cover some other isolated bases or arcs whence *u* is not the only child of *v*. Moreover, only isolated bases are mapped to leaves, thus *u* is mapped to an internal vertex. Therefore, *u* corresponds to a companion. If \((1, 2b+k)\) is a base pair, it will be clearly mapped to the unique child of the vertex \((0, 2b+k+1)\). The remaining statements follow directly from the Schmitt–Waterman bijection.

If \((1, 2b+k)\) is a base pair, then it is the outermost arc of a helix. As a consequence of the above discussion, the other \(h-1\) helices determine \(h-1\) companions that are not the root. Taking account of the root, there are *h* companion in total. Analogously, if \((1, 2b+k)\) is not a pair, there are \(h+1\) companions, and the proof follows.

### Proof of Lemma 3.3

For a bulge loop of length *q*, by definition, the arc *v* of the bulge directly covers *q* isolated bases and an arc which is either to the left or right of all *q* isolated bases. Note that the corresponding vertices covered by *v* are leaves in the corresponding rake and their left-to-right order stays the same as in the secondary structure.

We next consider the exterior loop. By definition, the number of arcs directly covered by \((0, 2b+k+1)\) is \(e_d-1\) and the number of directly covered isolated bases is exactly \(e_l\). If \((1, 2b+k)\) is not a pair, then, according to Lemma 3.1, the arcs directly covered by \((0, 2b+k+1)\) are mapped to companions that will give rise to marked leaves in the corresponding rake rooted on \(b+k+1\) in view of Lemma 3.2. The isolated bases in the exterior loop are mapped to the unmarked leaves of the rake. If \((1, 2b+k)\) is a base pair, then \(e_d=2\) and \(e_l=0\). In this case, the helix having \((1, 2b+k)\) as the outermost arc and the associated loop together with \((0, 2b+k+1)\) give rise to the rake rooted on \(b+k+1\), and no rake is associated with the exterior loop. The remaining statements follow analogously.

### Proof of Theorem 3.4

Note that the length of the RNA secondary structures under consideration is \(2b+k\). Let

We first consider the case that \((1,2b+k)\) is not a base pair, i.e., \(e_d>2\), or, \(e_d=2\) and \(e_l>0\). As discussed above, we turn to enumerating the corresponding labelled plane trees on \([b+k+1]\) with \(b+k+1\) as the root. In view of Theorem 2.2 and Lemma 3.1, the corresponding labelled plane trees for the RNA secondary structures in this case can be decomposed into \(h+1\) rakes, where the rake rooted on \(b+k+1\) corresponds to the exterior loop and the label \((b+k+1+h)^*\) is a leaf there. Based on Lemma 3.3, we enumerate the corresponding forests of rakes according to the following sequential construction.

(I) Determine the number of ways for constructing leaves of the rakes corresponding to hairpin loops. We assume those labelled rakes are arranged linearly according to the length of the corresponding hairpin loops increasingly, and those of the same length are arranged according to the minimum unmarked leaves increasingly. This is done as follows: choose *P* unmarked labels out of \([b+k]\) in \({b+k \atopwithdelims ()P}\) different ways, and arrange them linearly in *P*! ways, then cut the obtained sequence into segments such that the first \(p_1\) segments have length one, the next \(p_2\) segments have length two, etc. Here, a segment gives leaves of a hairpin-rake. However, the respective minimum unmarked leaves in the \(p_i\) segments of the same length *i* could yield any relative order. Thus, we next need to divide \(\prod _{i>0} p_i !\) so that only the ones in increasing order are counted.

(II) Determine the number of ways for constructing leaves of the rakes corresponding to bulges. This can be first done in

different ways to place the unmarked leaves. Next, we need to place a marked leaf other than \((b+k+1+h)^*\) either to the left or to the right of each segment of unmarked leaves. This can be achieved in \({h-1 \atopwithdelims ()g} 2^g g!\) different ways.

(III) Determine the leaves of the rakes corresponding to interior loops. This is obtained analogously to the case of bulges, with the exception that for an interior loop of length *i*, there are \(i-1\) possible spaces between unmarked leaves to place a marked leaf. Hence, we arrive at

(IV) Determine the number of ways for constructing leaves of the rakes corresponding to multiloops. Suppose those rakes are arranged linearly according to the degree and then, the length of the corresponding multiloops increasingly, and those of the same length and degree are arranged according to the minimum marked label in increasing order. The enumeration is analogous, specific differences being the following: the length of a multiloop may be zero, and the marked leaves could be at any positions relative to the unmarked ones. Accordingly, we have

(V) Determine the number of ways for constructing leaves of the rake corresponding to the exterior loop. Recall that there is exactly one rake corresponding to the exterior loop. This can be done as follows: pick \(e_l\) unused unmarked labels from \([b+k]\) and arrange them together with the remaining unused marked labels linearly. This results in the multiplicity

(VI) Determine the number of ways for constructing stems for all rakes. Note that the stem of the rake corresponding to any loop except the exterior loop could have any size respecting the length distribution of the helices. This is equivalent to the number of ways of first arranging the remaining unused (unmarked) labels linearly and then dividing the resulting sequence into segments with the lengths respecting the length distribution of the helices, given by

Accordingly, the total number of distinct forests of rakes is given by

Dividing the last number by \((b+k)!\) and expanding the involved binomial coefficients, i.e., using \({m\atopwithdelims ()n}=\frac{m!}{n!(m-n)!}\), we observe lots of cancellation. For example, the first line of the above four-line expression (after dividing \((b+k)!\)) becomes

With this, we eventually obtain the desired formula.

It remains to consider the case of \((1,2b+k)\) being a pair, i.e., \(e_l=0\) and \(e_d=2\). Then, the corresponding forests consist of *h* rakes where no rake is associated with the exterior loop. If there is only one helix (hence one loop being of a hairpin), i.e., \(h=1\) and \(p=1\), then there is only one rake in the corresponding forest and no marked label, and the desired number is one. Otherwise, we analogously enumerate the corresponding forests of rakes according to the following construction.

(a) Determine the number of ways for constructing leaves of the rakes corresponding to hairpin loops. This is analogous to the computation of hairpin loops in the previous case.

(b) Determine the number of ways for constructing leaves of the rakes corresponding to bulges. Note that we have only \(h-1\) marked leaves in total. But in difference to the situation analyzed in the previous case, it is possible for the largest marked label \((b+k+1+h-1)^*\) to be contained in a rake that is associated with a bulge. Accordingly, the number of ways to construct bulge-rakes is given by

(c) Determine the number of ways for constructing leaves of the rakes corresponding to interior loops and multi-loops. This is analogous to the previous case.

(d) Determine the number of ways for constructing stems for all rakes. The stem of the rake corresponding to any loop could have any length respecting the length distribution of the helices, with the exception that the length of one helix increases by one due to the auxiliary arc \((0,2b+k+1)\). This is equivalent to the number of ways of first arranging the remaining unused unmarked labels (other than \(b+k+1\)) linearly and then cutting the resulting sequence into segments with the lengths respecting the original length distribution of the helices, and finally associating \(b+k+1\) (as the root) to the rake that contains \((b+k+1+h-1)^*\), which is given by

As a result, the total number of distinct forests of rakes for this case is given by

Dividing the last number by \((b+k)!\) and subsequent simplification (involving lots of cancellation) yields the formula which agrees with the formula obtained for the previous case.

### Proof of Theorem 3.5

The proof is similar to that of Theorem 3.4. We first consider the case \(e_d>2\), or, \(e_l>0\) and \(e_d=2\).

(I) Determine the number of ways for constructing stems of the rakes. There is one rake rooted in \(b+k+1\) whose stem has length one. For the remaining *h* rakes, we arrange them according to the minimum elements contained in the stems in increasing order. This can be done by first picking *b* elements out of \([b+k]\) in \({b+k \atopwithdelims ()b}\) possible ways and arranging them linearly in *b*! ways. Next, we dissect the resulting sequence into *h* segments such that each segment has a length at least \(\sigma \) in \({h+(b-h\sigma )-1 \atopwithdelims ()b-h\sigma } \) different ways, and finally, we normalize by the factor \(\frac{1}{h!}\).

(II) Determine the number of ways for constructing loop types associated with the rakes. Evidently, by construction the exterior loop is associated with the rake rooted on \(b+k+1\). Among the remaining *h* rakes, we choose *p* of them for hairpin loops, *g* of them for bulges, *t* of them for interior loops, and \(m_j\) of them for multiloops of degree *j*. The number of ways of doing this is given by

(III) Place the marked leaves. The marked label \((b+k+1+h)^*\) is contained in the exterior loop. There are

ways to next place one marked leaf to each bulge and each interior loop, and place \(j-1\) marked leaves to each multiloop of degree *j*.

(IV) Place the unmarked leaves. In each hairpin loop, there are at least \(\theta \) unmarked leaves, and in each bulge, there is at least one unmarked leaf either to the left or to the right of the marked leaf. In each interior loop, there exists at least one unmarked leaf on both sides of the marked leaf. Subject to these constraints, the number of distinct placements is given by

As for the exterior loop, the remaining labels are contained in the rake corresponding to the exterior loop, and there are \((e_l+e_d-1)!\) different ways to arrange them.

In conclusion, the number of ways for constructing such forests is given by

and dividing by \((b+k)!\) produces the formula.

Next, we consider the case \(e_d=2\) and \(e_l=0\).

If there is only one helix (hence one loop), then the desired number is obviously one. Otherwise, there are in total \(h>1\) rakes, and \(b+k+1\) and \((b+k+1+h-1)^*\) are contained in the same rake. Analogously, the desired number in this case reads

Simplifying the last expression produces the formula in the theorem, completing the proof.

## Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

## About this article

### Cite this article

Chen, R.X.F., Reidys, C.M. & Waterman, M.S. RNA Secondary Structures with Given Motif Specification: Combinatorics and Algorithms.
*Bull Math Biol* **85**, 21 (2023). https://doi.org/10.1007/s11538-023-01128-5

Received:

Accepted:

Published:

DOI: https://doi.org/10.1007/s11538-023-01128-5