Prediction of RNA Secondary Structure Using Butterfly Optimization Algorithm

Chatterjee, Sajib; Debnath, Rameswar; Biswas, Sujit; Bairagi, Anupam Kumar

doi:10.1007/s44230-024-00062-6

Prediction of RNA Secondary Structure Using Butterfly Optimization Algorithm

Research Article
Open access
Published: 02 March 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Human-Centric Intelligent Systems Aims and scope Submit manuscript

Prediction of RNA Secondary Structure Using Butterfly Optimization Algorithm

Download PDF

Sajib Chatterjee¹,
Rameswar Debnath¹,
Sujit Biswas² &
…
Anupam Kumar Bairagi ORCID: orcid.org/0000-0003-1639-1301¹

356 Accesses
Explore all metrics

Abstract

Ribonucleic acid (RNA) structure is vital to its ability to function within the cell. The ability to predict RNA structure is essential to implementing new medications and understanding genetic illnesses. It is also important in synthetic and computational biology. All these functions are directly related to its secondary structure. Also prediction of RNA secondary structure process is the most significant step to determining the tertiary structure of RNA. On account of this, prediction of secondary structure of RNA is the crying topic in bioinformatics. In this research, we present the swarm-based metaheuristic Butterfly Optimization Algorithm (BOA) method for predicting the secondary structure of RNA. The main feather of the BOA is that it can conduct both local and global search simultaneously. According to the problem perspective, we have redesigned the operators of BOA to perform global and local search operations in different ways. We have followed a thermodynamic model for the selection of the stable secondary structure with minimum Gibbs free energy. Predicting the minimum free energy value we also developed an “Optimize” function to search the new optimize structure. This function increases the prediction efficiency, creating new stable structure and also decreases the time complexity of global searching procedure. We have used a public dataset to perform the prediction operation. To accuse our prediction efficiency, we have compared our outcomes to existing popular algorithms. The result shows that the proposed approach can predict secondary RNA structure better than other state-of-the-art algorithms.

Nature inspired optimization algorithm for prediction of “minimum free energy” “RNA secondary structure”

Article 21 September 2019

RNA Secondary Structure an Overview

Acceleration based Particle Swarm Optimization (APSO) for RNA Secondary Structure Prediction

1 Introduction

RNA is one biological macromolecules that is most significant for all known forms of life and has a similar structure to DNA but differs in minor ways. RNA holds significant importance in protein synthesis, transferring genetic information into the cells, DNA replication, managing gene activity during evolution, cellular differentiation, and participating in genetic evolution. Protein synthesis involves three different varieties of RNA, namely ribosomal RNA (rRNA), transfer RNA (tRNA), and messenger RNA (mRNA). Where mRNA contains information for protein synthesis, tRNA transports amino acids to ribosomes as input for protein synthesis and rRNA is the core of a cell’s ribosomes [1]. RNA is a molecules consisting of long chain of nucleotides, and nucleotides in RNA are classified into four types: adenine (A), uracil (U), guanine (G), and cytosine (C).

In RNA, there are three types of structure. The primary structure is the strings of nucleotide in a straight-line-sequence like GCCUCAUGGUGGUGGCUGGGGGCAGCCUCAUGGUGGU GGUGGCUGGGG. This structure serves as a means of distinguishing one RNA from another and notifies short information about the RNA structure. The secondary structure (2D bonded base pair) is a folding of the molecule on itself by forming hydrogen bonds between C-G, A-U and G-U. The complimentary nucleotides are connected by hydrogen bonds. Whereas (C-G) and (A-U) are regular canonical base pairs, also called the Watson–Crick base pair, and (G-U) is the less stable non-canonical ‘Wobble’ base pair [2]. Forming canonical and non-canonical base pairs in RNA secondary structure involves the folding operation [3]. Hence, predicting of the RNA secondary structure returns to predict all the hydrogen bonds from the primary structure of the sequence. Tertiary structure provides a three-dimensional view of RNA molecules.

Predicting the secondary structure is extensively viewed as the first step towards recognizing the function of an RNA molecule. Researchers have been focusing on determining the secondary structure of RNA for several decades because understanding hereditary illnesses and discovering new treatments are the most pressing concerns. It also aids biologists in determining the importance of a substance in a cell [4]. The structure of RNA aids in understanding the functionality of RNA. Physical methods like Nuclear Magnetic Resonance (NMR) and X-ray crystallography have been developed to anticipate the structure of RNA. However, these procedures are complex, time-consuming, and costly [5]. Researchers have recently concentrated on applying mathematical and computational tools to determine the best strategy to address RNA structural difficulties. Many approaches and algorithms [6,7,8,9,10,11] have recently been developed to handle RNA secondary structure difficulties.

The RNA secondary structure prediction problem is predicting from a primary RNA sequence its secondary structure representation. The problem is declared to be NP-hard. The most important factor on predicting accuracy of RNA structure is the length of the sequence. Usually with the increasing of the molecule size in an RNA sequence the accuracy gets low [12]. A dynamic programming approach based on free energy minimization with a polynomial complexity of O(n⁶). Using it in practice, especially for long-length RNA sequences, poses a significant challenge [13]. Utilizing dynamic programming with a focus on minimizing free energy reveals that thermodynamic models employed for estimating the free energy of an RNA secondary structure are generally accurate only within a 5–10% margin. This is problematic since many RNA secondary structures lie within 5–10% of the global minimum free energy. Another hurdle in this issue is that heuristic approaches offer no assurance of locating the structure with the minimum free energy, yet they can be faster and more adept at handling long length RNA sequences. Furthermore, they are inherently far less constraining compared to dynamic programming algorithms concerning the complexity of the underlying energy model. For that reason it is not necessary to find the global minimum free energy. In this scenario it’s good enough to get most stable structure with close to the minimum free energy.

In this paper, we present a naturally inspired swarm-based metaheuristic Butterfly Optimization Algorithm (BOA) to investigate the RNA secondary structure. To resolve this issue, we have followed the food-searching nature of butterflies and performed local and global search operations premised on the sensitivity of butterfly fragrance. These operations help us find a stable structure and optimum solution. To predict RNA secondary structure, we have followed some steps. We have designed four operators separate global search, reverse global search, exchange local search and marge local search. Separate global search divides each molecules or structure into two structures inject random elements that help to find global minima point. Reverse global search combine two different region local minima points to explore to search global minimum structure. Exchange local search exchanges different monomer position in a local region of the search space and works like mutation operator for create little changes among the molecules. Marge local search chose two different structures in a local region area and marge the even and odd position between this two structures to create local search minima points. An “Optimize” function discard extra duplicate point among the structure and that’s speed up the searching procedure. The advanced optimize function optimizes the search operation result and decrease the time complexity of the process. In recent years, BOA has been used to solve many optimizations problem, and this algorithm gives a better result than existing algorithms for classification and optimization problems. Such as feature selection problem [14], Node Localization in Wireless Sensor Networks problem [15], A Self-Adaption Numerical Optimization Problem [16], Evolving Artificial Neural Networks Data Classification problem [17], a novel approach for global optimization problem [18] and protein folding optimization problem [19]. For the robust performance of BOA algorithm for optimization problem we have chosen BOA for solving RNA secondary structure prediction problem.

The major contributions and novelty of this work are summarized as follow:

A novel efficient approach has been developed using Butterfly Optimization Algorithm (BOA). BOA applied for different NP-hard problem but not used for RNA secondary structure prediction problem.
We have designed both local and global search with four operators: separate, reverse, exchange and marge. All these operators make the searching process of best structure of RNA sequence robust and convenient.
Optimize butterfly function is another novel task. We implement this function to remove extra duplicate stem number(s). If the four operators of BOA produced duplicate stem number(s) then optimize butterfly discard the duplicate and make valid structure. This operator makes the BOA most time efficient.

The paper layout is as follow: the related works of RNA structure prediction will be described in Sect. 2. The butterfly optimization algorithm based on RNA structure prediction problem has been illustrate and describe in Sect. 3. The experimental results and comparison with the state of the art paper of RNA secondary structure prediction problem are shown in Sect. 4. The conclusion of the paper is described in Sect. 5.

2 Related Works

To anticipate the RNA secondary structure, many approaches or algorithms have been devised. Dynamic programming, heuristics and metaheuristic algorithms are applied for solving this NP-hard problem. The efficient algorithms are reviewed below.

2.1 Dynamic Programming (DP)

Dynamic programming (DP) is a technique for decomposing any big problem into smaller ones in order to solve it. Each tiny problem is then resolved, the outcomes are stored, and the recursive method is used to calculate the results afterward. Tatsuya Akutsu [13] proposed Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots. The time complexity of the method increases to O(n⁵) or more when complicated pseudoknots are handled. Another important difficulty is that this method has no identification of what class of pseudoknot it should cover and no established energy function specifically for the loop regions. Zuker [20] created the DP-based method m-fold for creating the RNA secondary structure with the least amount of free energy possible. The dynamic programming algorithm cannot consider the kinetic factors related to easily accessible states in RNA folding. Kengo Sato and Yuki Kato [21] developed a linear partition model that enhances the prediction of secondary structures in long sequences by considering pseudoknots. Despite this advancement, the model's prediction accuracy for crossing base pairs remains suboptimal.

2.2 Deep Learning

In their work presented in [22], the authors introduced an algorithm based on deep learning, aimed at predicting RNA secondary structures. This approach integrates thermodynamic principles into its framework. The effectiveness of machine learning-based methods, such as this one, is anticipated to enhance prediction accuracy as the volume of training data increases. Additionally, the authors identified that a significant challenge in predicting secondary structures from single sequences is the absence of prior knowledge about the sequences involved. In paper [23], the researchers introduced REDfold, an innovative algorithm for predicting RNA secondary structure. This algorithm employs a residual encoder-decoder learning network as its core. Unlike traditional methods that rely on dynamic programming, REDfold uses constrained optimization. This approach allows for the prediction of structures beyond the limitations of nested folding patterns. Fu et al. [24] proposed UFold: fast and accurate RNA secondary structure prediction with deep learning. Without accessing more training data, achieving higher accuracy in predictions is not possible here.

2.3 Heuristic Algorithms

By the nature of various bioinformatics problems are very difficult to solve optimally and inside the polynomial time of their size. For that reason, bioinformatics motivates to the use of heuristic algorithms. A heuristic algorithm represent an algorithmic structure that are enable to produce an adequate solution of a problem within polynomial time in the real scenarios, but the solution is the actual optimal result there is no formal proof of its. Under the given constraints of a problem, when there is no familiar method to find an optimal solution then heuristic algorithms are typically used.

2.3.1 Genetic Algorithm (GA)

Genetic algorithm (GA) and simulated annealing (SA) are used for predicting RNA secondary structure prediction a hybrid framework approach in [25]. The authors created combination of these two algorithms. For the combination of GA and SA, GA is applied for a global search and SA is applied for a local search, and moreover for the combination of SA and GA, where SA is applied for a global search and GA is applied for a local search. The hybridization of GA and SA provide better perform than the single GA and SA. But the main pitfall of the algorithm it can predict only 2 order pseudoknots and for complex pseudoknots it could not predict accurate structure.

Based on thermodynamic models, Wiese developed the RNA predict technique [10]. Specifically, they assessed the causative link between the number of true positive base pairs and the free energy in a structure in that paper's first section to evaluate the effectiveness of the Individual Nearest Neighbor Hydrogen Bond (INN-HB) and the grouping energy-based thermodynamic models. Tong and Cheung [26] suggested a different strategy GAknot. Using GA they predicted pseudoknots RNA secondary structures. In addition to making it likely to look for MFE structures, GA provides several solutions that are substandard structures and other structures that are more closely related to the normal fold. While GA is useful for determining basic energy parameters but it's time-consuming for many helices.

2.3.2 Tabu Search (TS)

In [27], the authors proposed a tabu search-based RNA secondary structure prediction model RNATS. They use two different search models: intensification and diversification. These search models use the immediate regions surrounding the existing solution, explore the previously unexplored territory, and execute prediction operations using the minimal free energy technique. Then they experimentally analyzed their proposed method with six RNA sequences and the outcome gave a compelling performance.

2.3.3 Simulated Annealing (SA)

Schmitz and Seger [28] were the first to suggest utilizing Simulated Annealing to identify RNA secondary structures using a free energy optimization technique where the Iterative model and breakdown of single base pairs determine the secondary structure. Tsang and Grypma developed a permutation-based RNA Structure prediction method [3]. In this paper, they have applied swap mutation operator with adaptive annealing scheduling and flip mutation operator for geometric scheduled simulated annealing. Here, only sequences with lower energies can benefit from it, but in responsive scheduling, it’s much more time-consuming.

2.4 Meta-heuristic Algorithms

A meta-heuristic represents a remarkable problem independent algorithmic framework that provides the near optimal solution in polynomial time whereas exact algorithm fails to solve those.

2.4.1 Fruit Fly Optimization Algorithm (FOA)

This paper [29] used the FOA method to predict RNA secondary structure. For forecasting, they have designed four operators based on FOA. Those operators have been designed to perform local and global searches randomly. After that, they have chosen the resultant RNA secondary structures based on the least free energy calculated using the Gibbs free energy formula.

2.4.2 Particle Swarm Optimization Algorithm (PSO)

This algorithm uses a set-based technique for predicting RNA secondary structure [30]. The method's primary purpose is to increase the number of stems for a specific RNA sequence. Then they analyzed of the minimal free energy. The process consists of two stages: identify the level for each swarm at the initial stage, then move ahead to the next stage. They also demonstrated the Kruskal–Wallis test for testing the validity of the hypothesis based on post-hoc analysis. Because of complication of estimating RNA structure they have not done experiment on pseudoknotted RNA sequences.

2.4.3 Chemical Reaction Optimization (CRO)

CRO is a population-based meta-heuristic algorithm based on chemical reaction concepts. The authors in [11] proposed a CRO-based prediction algorithm to predict RNA secondary structure from the primary sequence. First, they generated solution space as a population and choose a probable sequence. Then they randomly decided to perform unimolecular or intermolecular collision operations based on the pre-define MoleColl value. They designed four operators to perform CRO-based RSSP functions.

3 Butterfly Optimization Algorithm

The Butterfly optimization algorithm (BOA) is a natural-inspired algorithm utilizing a population-based approach. The BOA was first introduced by Aroa and Sing [31] in 2019. It is a population based meta-heuristic algorithm that mimics the behaviors of natural butterflies. The capacity of a butterfly is to find food was the primary inspiration for this algorithm. Butterflies possess the highest smelling sense principle, allowing them to locate food from great distances and distinguish between distinct scents within a specific area [29]. The primary approach of the BOA optimization algorithm is foraging, which involves using their sense of smell to find food. In BOA, it is assumed that butterflies produce a smell of certain intensity. Butterflies continue on their way, using the phase as a global search point by detecting the scent of the other. Local search optimization is frequently referred to as a butterfly movement. Random generation is used to accomplish local and global searches. The BOA approach is founded on a trade-off between the smell and scent senses [14]. BOA is a very efficient algorithm with low complexity and a high degree of solving convergence.

3.1 Objective Function for RSSP

The secondary structure of RNA is delineated by a list of base pairs formed from its primary sequence. Let S = s1, s2, …, sn represent an RNA sequence, where S is a string composed of alphabets {a, u, g, c}. A pair (p, q) is termed a base pair (complementary) if {p, q} equals {a, u} or {g, c}. Pairs such as {a, g} and {c, u} are not recognized as base pairs. Among these, the most stable and common base pairs include {g, c}, {a, u}, and {g, u}, along with their counterparts: {c, g}, {u, a}, and {u, g}. Once all these pairs are formed, the RNA strand folds back upon itself, giving rise to its secondary structure.

Our primary objective is to maximize the number of stems in order to construct an RNA secondary structure from a given sequence and select the most stable secondary structure. The secondary structure can be determined for an individual sequence using thermodynamic principles. These thermodynamic methods predict the stability of a structure and rely on nearest neighbor rules. The stability of a structure can be quantified by calculating the minimum free energy. The structure with the lowest free energy is considered the most stable.

In this method, we have established an objective function, as presented in Eq. (1), based on the individual nearest-neighbor hydrogen bond model (INN-HB), which is a subset of thermodynamic models [11]. The free energy of each helix is calculated using Eq. (2).

$$F={\text{min}}\left\{\Delta {{\text{G}}}_{{\text{k}}}\right\};1\le k\le n;n=the\, number\, of\, secondary\, structure\, in\, one\, sequence$$

(1)

$$\Delta G_{37}^\circ = \Delta G_{37init}^\circ + \sum {[\Delta G_{37NN}^\circ ] + \Delta G_{37AU/GUend}^\circ } (perAU/GUend) + \Delta G_{37sym}^\circ$$

(2)

The alternative method for RNA structure prediction that we have used, known as the maximum expected accuracy structure, is determined by the maximum sum of pairing probabilities. Each individual secondary structure prediction sequence is suitably appropriate. The graphical representations of secondary structure prediction problem are shown in Fig. 1 (Table 1).

Table 1 Symbol table for Eq. (2)

Sl. no.	Sequence	Accession number	RNA class	Length (nt.)	#Base pair
1	G. stearothermophilus	AJ251080	5 s rRNA	117	38
2	S. cerevisiae	X67579	5 s rRNA	118	37
3	E. coli	V00336	5 s rRNA	120	40
4	H. marismortui	AF034620	5 s rRNA	122	38
5	T. aquaticus	X01590	5 s rRNA	123	40
6	D. radiodurans	AE002087	5 s rRNA	124	40
7	M. anisopliae(3)	AF197120	Group I intron, 23S RNA	394	120
8	C. saccharophila	AB058310	Group I intron, 23S RNA	454	126
9	M. anisopliae(2)	AF197122	Group I intron, 23S RNA	456	115
10	A. lagunensis	U40258	Group I intron, 23S RNA	468	113
11	H. rubra	L19345	Group I intron, 23S RNA	543	141
12	A. griffini	U02540	Group I intron, 23S RNA	556	131
13	P. leucosticta	AF342746	Group I intron, 23S RNA	605	121
14	C. elegans	X54252	16S RNA	697	189
15	D. virilis	X05914	16S RNA	784	233
16	A. cahirinus	X84387	16S RNA	940	260
17	X. laevis	M27605	16S RNA	945	251
18	H. sapiens	J01415	16S RNA	954	266
19	A. fulgens	Y08511	16S RNA	964	265
20	S. acidocaldarius	D14876	16S RNA	1492	458

Prediction of RNA Secondary Structure Using Butterfly Optimization Algorithm

Abstract

Similar content being viewed by others

Nature inspired optimization algorithm for prediction of “minimum free energy” “RNA secondary structure”

RNA Secondary Structure an Overview

Acceleration based Particle Swarm Optimization (APSO) for RNA Secondary Structure Prediction

1 Introduction

2 Related Works

2.1 Dynamic Programming (DP)

2.2 Deep Learning

2.3 Heuristic Algorithms

2.3.1 Genetic Algorithm (GA)

2.3.2 Tabu Search (TS)

2.3.3 Simulated Annealing (SA)

2.4 Meta-heuristic Algorithms

2.4.1 Fruit Fly Optimization Algorithm (FOA)

2.4.2 Particle Swarm Optimization Algorithm (PSO)

2.4.3 Chemical Reaction Optimization (CRO)

3 Butterfly Optimization Algorithm

3.1 Objective Function for RSSP

3.2 Algorithm Design for RSSP

3.2.1 Initialization

3.2.2 Iteration

3.2.2.1 Separate Global Search

3.2.2.2 Reverse Global Search

3.2.2.3 Exchange Local Search

3.2.2.4 Marge Local Search

3.2.2.5 Optimize Butterfly Function

3.2.2.6 Final Stage

3.3 Construct RNA Secondary Structure

4 Experimental Results

4.1 Experimental Setup

4.2 Results Analysis

5 Conclusion

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Consent to Participate

Consent for Publication

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation