Introduction

miRNAs (or microRNAs) are small, endogenous, and noncoding RNA constructs of about 22 nucleotides1. Cumulative evidence from biological experiments shows that miRNAs play a fundamental and important role in various biological processes such as regulation of gene expression by post-transcriptionally binding to 5'untranslated regions (UTR), coding sequences, or 3´UTR of target messenger RNAs (mRNAs)2,3. According to the latest release of an online miRNA database, miRBase (v22), there are 38,589 entries representing hairpin precursor miRNAs that express 48,860 mature miRNAs from 271 organisms such as humans, mice, rat, etc.4. The human genome, as a sub-category of the organism classification, contains 1917 annotated hairpin precursors, and 2654 mature sequences4. It is estimated that in mammals, approximately one-third of all protein-coding genes’ activities are controlled by miRNAs5. Several studies show that the deregulations of miRNAs are associated with many types of human diseases, e.g. cancer, cardiovascular diseases, or autoimmune diseases6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27. Due to these relationships between miRNAs and various diseases, studies to understand the functions, processes, and mechanisms of miRNAs are increasing dramatically28. Thus, how to classify miRNAs is a critical problem in computational biology.

The discovery of the first miRNA started in Caenorhabditis elegans in 1993 by Ambros and Ruvkun's studies. They found that the lin-4 was a small noncoding RNA but not a protein-coding RNA29,30,31. Seven years later, in 2000, the second miRNA, let-7, was reported. Experimental results show that let-7 consists of 21 nucleotide RNA and regulates timing in the transition from fourth level (L4) to adult C. elegans’ larval development32.

The biogenesis of miRNAs involves several steps and cellular mechanisms (Fig. 1), some in the nucleus and some in the cytoplasm. Since those processes have some different pathways, pre-miRNAs can be categorized into two categories: mirtrons and canonical miRNAs. Compared to mirtrons, canonical miRNAs are more conserved and easier to be identified33. The first step of the biogenesis of miRNAs begins with the transcription of miRNA genes that make up primary miRNA hairpins called pri-miRNA34,35,36. In the canonical pathway, pre-miRNAs with the hairpin structure are formed by the microprocessor complex consisting of Drosha and DGCR by dividing pri-miRNAs in the nucleus37,38. Then, the pre-miRNAs are produced in the nucleus and transported into the cytoplasm by exportin-5. Following this, pre-miRNAs are cleaved in the cytoplasm into small RNA duplexes by another RNase III enzyme Dicer and finally, mature miRNA is produced39,40. In the mirtron pathway, for bypassing the nuclear enzyme Drosha, it uses splicing to produce short pre-miRNA hairpin introns41,42. The next steps of those pre-miRNAs are in the same pathway as canonical miRNAs43. Mirtrons can also be divided into three categories: canonical, 3′ tailed, and 5′ tailed due to their sequence and structure42. Compared to canonical miRNAs, mirtron hairpins and small RNAs have numerous distinguishing features44,45.

Figure 1
figure 1

Overview of miRNA biogenesis.

In previous studies, numerous computational methods, such as decision trees (DT), random forest (RF), and support vector machines (SVM), widely applied in miRNA identification and classification as applied in computational biology and healtcare46,47,48,49,50,51,52. Recently, Deep Learning (DL) methods are also frequently used to achieve better prediction accuracy compared with other traditional machine learning methods 53,54,55,56,57,58,59. The convolutional neural networks CNNs, a type of DL, have successfully employed for pre-miRNAs clasification33,56. For instance, Zheng et al.33 proposed a nucleotide-level CNN model. They encoded the sequences using “one-hot” encoding then padded each entry with the same shape. The model had convolutions and max-pooling layers. Their investigation showed that their CNN-based network feasible to apply to extract features from biological sequences. CNN-based methods outperform to identify the miRNAs and extract features automatically from the raw input data without detailed domain knowledge60,61,62,63. However, Park et al.64 show that the spatial information of these structures is as important as the structures that make up miRNAs. Therefore, they focused on only long-term dependencies and proposed an LSTM based framework to identify precursor miRNAs. Moreover, much research reveals that CNN-LSTM networks give a solution to use both structural characterization and spatial information together. A CNN-LSTM network is a combination of CNN layers for feature extraction on input data and LSTM layers to provide sequence prediction65. These networks are used in a variety of problems such as activity recognition, image description, video description, visual time series prediction, and generating textual annotations from image sequences65,66. Quang et al. proposed a hybrid CNN-LSTM framework67, DanQ, for predicting the function of DNA sequences. In this model, the convolution layer captures patterns, and the recurrent layer captures long-term dependencies. Similarly, Pan et al. proposed iDeepS68, to identify the binding sequence and structure patterns from RNA sequences. Their model extract features by using CNN and reveals possible long-term dependencies by using bi-directional LSTM (BLSTM). These successful studies show that utilizing both spatial and sequential features provides higher performance, especially in computational biology.

Many existing pre-miRNA classification methods focus on either sequential structure or spatial structure of pre-miRNAs. The main features that distinguish pre-miRNAs from each other are the types, number, and sequence order of amino acids that make up their fundamental structure. Hence, using a hybrid CNN-LSTM based network can give a solution to classify pre-miRNA facilitating with both spatial and sequential features of pre-miRNAs.

Materials and methods

In this study, we presented a hybrid deep learning method for pre-miRNA classification based on both sequential and spatial structure of pre-miRNA by integrating two different networks respectively: CNN and LSTM. We first described the problem of pre-miRNA classification. Then, we introduced the dataset, which is used to train and evaluate the proposed method. The dataset consisted of human mirtrons and canonical miRNAs sequences44. For consistency, the same sequence data were used as the previous models33,46. CNN extracted features from the input data automatically. Thus, it gave a solution to the problem of manual extraction of features. LSTM layer was used to perform temporal modeling following the CNN layer that convolved the input data. Next, we gave comprehensive details about CNN, LSTM, and CNN-LSTM networks. Finally, we described our proposed method and how to implement it in detail. The method was implemented in python using the Keras library (2.4.3) https://github.com/keras-team/keras, with the backend of TensorFlow (2.4.0).

Problem statement

Many existing pre-miRNA classification methods rely on manual future extraction. These methods focus on either spatial structure or sequential structure of pre-miRNAs. To overcome the limitations of previous models, we propose a nucleotide-level deep learning method based on a hybrid CNN and LSTM network together for pre-miRNAs classification. When we consider the structure and sequence of pre-miRNAs, it is clear that the problem is a binary sequence classification problem consisting of mirtrons and canonical miRNAs. In literature, several models including machine-learning methods have been developed to find a solution for the problematic classification. On the other hand, they have approximately 90% of accuracy. In this study, in the pre-miRNA classification, our goal was to show how to accurately predict classes with a hybrid CNN-LSTM network.

Convolutional neural networks

A CNN network is a type of deep learning that produces excellent performance and has been widely applied to many applications such as image classification69,70, object detection71,72, speech recognition73, computer vision74, video analysis75, and bioinformatics76,77. Apart from the traditional neural networks, CNN includes numerous layers that make it deeper. Moreover, CNN uses weights, biases, and outputs via a nonlinear activation. A typical CNN architecture fundamentally consists of convolutional layers, pooling layers, and fully connected layers63.

The convolution operation used in the convolutional layer is as follows:

$$ F\left( {i,j} \right) = \left( {I*K} \right)\left( {i,j} \right) = \sum \sum I\left( {i + m,~j + n} \right)K\left( {m,n} \right)~ $$
(1)

where I for input matrix, K for a 2D filter of size m × n, and F for the output of a 2D feature map. And, the convolutional layer representation is with I*K.

Long short-term memory networks

An LSTM network is a class of recurrent neural network (RNN) that uses memory blocks that assist to run successfully and learn faster than traditional RNN78,79. LSTM networks find practical solutions for the vanishing and exploding gradient problems of RNNs80. Apart from the RNNs, a cell state is used in the LSTM network to save long-term states including input, forget, and output gates. Thus, the network can remember previous data and connect it with the present ones. Also, it solves complicated tasks difficult to find a solution by previous RNNs79,81.

CNN and LSTM networks

A CNN-LSTM model is a combination of CNN layers that extract the feature from input data and LSTMs layers to provide sequence prediction65. The CNN-LSTM is generally used for activity recognition, image labeling, and video labeling. Their common features are that they are developed for the application of visual time series prediction problems and generating textual annotations from image sequences65,66.

Figure 2 shows the basic architecture of the CNN-LSTM network with the input layer, visual feature extraction, sequence learning, and output layer, respectively65.

Figure 2
figure 2

The basic architecture of the CNN-LSTM network.

Training and test datasets

The dataset consists of mirtrons and canonical miRNAs’ data. We combined two different datasets in our preprocessing data phase with 707 (63%) canonical miRNAs and 417 (37%) mirtrons. The first dataset (Dataset 1) consisted of mirtrons and canonical miRNAs derived from miRBase (v21) according to the annotation of Wen et.al.44. Moreover, the second dataset (Dataset 2) was derived also from the study of Wen et al.44, which included 201 entries, putative mirtrons data. In total, we used 1,124 entries in our proposed model. The same dataset used to be consistent with Zheng et al. and Rorbach et al.33,46.

Stratified k-folds cross-validation (CV) is a resampling procedure that splits of the dataset into folds according to the output categories and ensures that each fold has the same proportion. It is useful for imbalanced datasets82. Hence, we used stratified 5-folds CV for training and evaluating our model82. At each iteration, it divided the data into training and test sets with a 80–20% split. In the next iteration, it used the other percentile as the training and test set.

Table 1 shows the distribution of the training and test datasets at each iteration in stratified 5-folds CV.

Table 1 Distribution of the training and test datasets in stratified 5-folds CV.

The preprocessing of the data

The entry with the most sequences had 164 bases. Therefore, we prepared each sequence of entries with the maximum length (164) by padding. The word "N" was used for keeping the sequences in the same length. Like Zheng et al.33, “one-hot” encoding is used to encode each base of the sequences (Table 2). Next, we converted each sequence into a vector with a dimension of (164, 4) by the vectorization process.

Table 2 “One-hot” encoding for the base sequence.

The method architecture

We designed the architecture of our model with nine layers: an input layer, four CNN layers wrapped by the time-distributed layers, an LSTM layer, a dense layer, a dropout layer, and an output layer, respectively. Figure 3 shows the illustration of the architecture with visualization of our method. Before constructing the model, we ensured that each data has been transformed into an appropriate form to be used. In this case, we used the padding process to guarantee the length (which is 164) of each miRNA sequence similar by adding "N" for each blank. The next vectorization step was transforming the padded sequences to like a m × n matrix by using one-hot encoding.

Figure 3
figure 3

Detailed architecture with visualization of the proposed methodology.

When all data are padded and vectorized, the network became ready for the feature extraction process. In this stage, three convolution layers were used to automatically extract features from input sequences using the relu activation function. In these convolutional layers, 128 filters were used. The kernel's height was selected as 6 and the kernel's width was selected 4 for convolution operation. This kernel size gives higher performance33. In these convolutional stages. We wrapped the convolution layers in a time-distributed wrapper to reshape input data by adding extra dimension at the end. For concatenation of all extracted features, we employed a flatten layer for passing to the LSTM layer. Then, one LSTM layer was designed with 100 units following a dropout layer (0.5) on the fully connected layer. Finally, for binary classification, the softmax activation function was used for specifying outputs. The model was optimized for 30 epochs, 6 for batch size, and 0.1 for validation split by training. The validation dataset monitors the convergence in the training process so that the training of the model can be canceled early according to the change in this convergence. Besides, adam optimizer with a 0,001 learning rate for optimization and categorical cross entropy for loss function was preferred during the optimization process. Adam is one of the gradient descent algorithms that calculate adaptive learning rates for each momentum-like parameter83 and categorical cross entropy is one of the loss functions preferred when there are two or more one-hot encoded label classes84. It optimizes multi-class classification models with a softmax activation function.

Table 3 shows the model summary including the input layer, convolution layers, flatten layer, LSTM layer, fully-connected layer, softmax layer and classification layer with the shape and the number of the parameters.

Table 3 The method summary.

Method evaluation

In this study, the evaluation of our method was measured on the test dataset. We calculated five different measurements for performance in the analysis: accuracy (Acc.), sensitivity (Sen.), specificity (Spe.), F1 score, and Matthews Correlation Coefficient (MCC). They are calculated for evaluating predictive capability with the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) by the following equations:

Accuracy indicates the overall correctness of prediction:

$$ Acc = \frac{{TP~ + ~TN}}{{TP~ + ~FN + ~TN~ + ~FP~}} $$
(2)

Sensitivity, true positive rate, indicates the ratio of correctly classified actual positives:

$$ Sen = \frac{{TP}}{{TP + ~FN}} $$
(3)

Specificity, true negative rate, indicates the ratio of correctly classified actual negatives:

$$ Spe = \frac{{TN}}{{TN + ~FP}} $$
(4)

F1-Score is a combination of the precision and recall of the model by harmonic mean:

$$ F1~Score = \frac{{2TP}}{{2~TP + ~FP + FN}} $$
(5)

Matthews Correlation Coefficient (MCC) is a binary classifier that measures the quality:

$$ MCC = \frac{{TP \cdot ~~TN~{-}~FP \cdot ~~FN}}{{~\sqrt {\left( {{\text{TP~}} + {\text{~FP}}} \right)\left( {{\text{TP~}} + {\text{~FN}}} \right)~\left( {{\text{TN~}} + {\text{~FP}}} \right)\left( {{\text{TN~}} + {\text{~FN}}} \right)} }}. $$
(6)

Results and discussion

Due to the automatic feature extraction without a comprehensive domain expert from pre-miRNAs sequences by using CNN and LSTM, we designed a hybrid method for the classification of pre-miRNAs. We started with preparing the dataset by converting the raw sequences to vectors using “one-hot” encoding. Next, all data were padded and vectorized. Then, we used three convolution layers to extract features automatically from input sequences using the relu activation function. For concatenation of all extracted features, we employed a flatten layer for passing to the LSTM layer. Then, one LSTM layer is designed with 100 units following a dropout layer (0.5) on the fully connected layer. Finally, for binary classification, the softmax activation function was used for the specifying outputs. Table 4 shows the performance results of our proposed network at each iteration. Additionally, we calculated mean, median, standard deviation, and confidence interval (CI) for each metric.

Table 4 Performance of the proposed CNN-LSTM network for each fold.

Table 5 shows the performance comparison of the average values of the proposed method with the previous methods. The prediction resulted in 0.943 (%95 CI ± 0.014) accuracy, 0.935 (%95 CI ± 0.016) sensitivity, 0.948 (%95 CI ± 0.029) specificity, 0.925 (%95 CI ± 0.016) F1 Score and 0.880 (%95 CI ± 0.028) MCC (Table 4) When compared to the closest results, our network revealed the best results for Acc., F1 Score, and MCC. These were 2.51%, 1.00%, and 2.43% higher than the closest result, respectively. The mean of sensitivity had the highest value like Linear Discriminant Analysis and ranked first. These ratios indicate that the hybrid CNN and LSTM networks can be employed to achieve better performance for pre-miRNA classification compared with previous methods. Even though the results show that our model has a higher ratio according to accuracy, sensitivity, F1 score, and MCC; we have a lower ratio (94.8%) of correctly classified true negatives. In imbalanced or skewed datasets, the number of examples of the minority class might not be sufficient for learning. As a result, the minority group is more often misclassified than the majority group85,86. The number of positive and negative samples in our training and test dataset is equally representative of the entire dataset. Thus, we solve the misclassification problem at the data preparation level.

Table 5 Performance comparison of pre-miRNA classification.

This study is an investigation of the pre-miRNA classification problem through a convolutional neural network and long short-term memory network. In contrast to other methods, we took into account both the sequence structure and the spatial information of each entry. The preprocessing of data is the first but the most important stage of our study. Indeed, inappropriate preparation of the data will cause the network to be trained incorrectly and will make it difficult to obtain reliable results. Thus, we checked all outputs after the encoding, padding, and vectorization process. In addition, cascading the different neural networks is another issue in the model construction. Inappropriate network design may increase the bias and cause unexpected results. Therefore, we ensured that all the layers cascaded correctly.

Hyper-parameters determine the general characteristics of deep neural networks. The number of the hidden units, order of the layers, batch size, optimizer selection, and learning rate, etc. directly affect the performance of the methods. In this study, we utilized the previous researcher's experiments in addition to our experiments. For instance, Zheng et al.33 discovered that kernel size (6 × 4) and unit number (128) of the CNN network produced the best results according to other sizes and numbers in the pre-miRNA classification. When we tested hyperparameters like Zheng et al.33, we obtained similar performance results as they did. In our future work, we will take into account the experiences we have gained in these studies and we will do more extensive hyper-parameter optimization to ensure performance increase.

Despite the promising performance of our model, there are still some limitations. The first limitation comes from the total number of entries (1124) in the datasets. Even though the datasets have well-defined data, it is important to feed the method with more training and testing data to obtain more reliable results. The second limitation is the unbalanced ratio of classes. In this study, the number of positive samples (417) was less than the number of negative samples (707). The ratio of positive and negative samples was approximately 1:1.7. This imbalanced ratio may lead to limit accuracy and other metrices. Thus, we will focus on more comprehensive datasets in the future research.

We consider that the quality and size of the related dataset are important for training a model and achieving robust classification prediction. In future studies, enhanced datasets may lead to the construction of more successful models in terms of similar evaluation parameters.

Conclusion

In this paper, we proposed a nucleotide-level hybrid deep learning method based on a convolutional neural network and long-short term memory network together. In the data preprocessing phase, we used one-hot encoding to convert each base to a matrix of the same size by padding. Then, we employed three convolution layers wrapped by a time-distribution layer. For concatenation of all extracted features, we employed a flatten layer for passing to the LSTM layer. Then, we designed one LSTM layer following a dropout layer on the fully connected layer. Finally, for binary classification, the softmax activation function is used for specifying the outputs. Our results showed that the proposed method was successfully trained on the training dataset and had a better performance on the test dataset than the previous models.

The results indicated that the hybrid CNN and LSTM networks can be employed to achieve better performance for pre-miRNA classification. In future work, we will study on the investigation of new classification models that deliver better performance in terms of all the evaluation metrics.