Background

N6-methyladenosine (m6A) methylation modification refers to the methylation that occurs on the sixth N atom of base A [1], accounting for 80% of eukaryotic mRNA methylation modifications [2, 3]. It was first discovered in the 1970s [4] and has been found to exist in many species such as animals, plants, bacteria, viruses [5]. All sites were found within sequences conforming to the degenerate consensus RRACH(A = m6A) [6, 7]. Studies have found that m6A plays a crucial role in various biological processes and ontogeny, including mRNA transcription, translation, nucleation, splicing, and degradation [8], as well as early development, sex determination, T cell homeostasis, antiviral immunity, brain development, biological rhythms, sperm genesis and directed differentiation of hematopoietic stem cells [9,10,11,12,13]. Besides, m6A methylation modification has been found to play a key role in the occurrence of diseases, such as glioma, leukemia, hepatocellular carcinoma, etc., [14,15,16]. Therefore, it is of great significance to unveil the mechanism of m6A methylation, where the specific modification sites should be first identified accurately.

At present, high-throughput sequencing technologies are widely used in the study of m6A modification, among which MeRIP-Seq is the most commonly used [17]. The procedure for MeRIP-Seq involves randomly fragmenting the RNA to fragments (namely reads) before immunoprecipitation, these reads are expected to map to a region that contains the m6A site near its center. Reads from the immunoprecipitation sample are frequently mapped to mRNAs and clustered as distinct peaks [18, 19]. Experimental-based high-throughput sequencing methods can perform sample-specific m6A site detection [20]. However, the MeRIP-Seq technology is relatively complicated with high cost and time, which limits its extensive use. Thus, some computational methods that can help predict m6A modification sites computationally are urgently needed.

Most conventional machine learning methods developed for sequence-based m6A site prediction often extract features first, then, developed classifiers to predict whether a site is methylated or not based on previously extracted features. For example, iRNA-Methyl extracts features based on pseudo dinucleotide composition, three RNA physiochemical properties, and uses SVM to construct a site prediction model [21]. SRAMP extracts features with three encoding methods, including positional binary encoding of nucleotide sequence, K-nearest neighbor (KNN) encoding as well as nucleotide pair spectrum encoding, then predicts sites by random forest classifiers respectively. Finally, the prediction scores of the random forest classifiers are combined through the weighted summing formula [22]. AthMethPre extracts the features of the positional flanking nucleotide sequence and position-independent k-mer nucleotide spectrum then uses an SVM classifier to predict m6A methylation sites [23]. The WHISTLE method firstly integrates 35 additional genomic features besides the conventional sequence features and then establishes an SVM classifier to predict m6A sites [24]. The prediction performance was greatly improved through the use of genomic features. However, genomic features are not always available under the scenarios that only some RNA sequences are given for m6A site identification. It is shown that the extraction of RNA sequence features and the design of classifiers all have an impact on the prediction performance of m6A modification sites. The methods mentioned above all establish a closed feature extraction model, which is independent of the following classifiers. Feature extraction is the key issue for most machine learning tasks. The quality of feature extraction is extremely critical. which greatly affects the performance of the final site prediction. On the contrary, deep learning models often follow the end-to-end design. From raw data to final output, the features are extracted based on both the input data and the final identification/prediction task. Besides, considering that RNA sequence contains abundant semantic information, which is similar to text sequences, it is heuristic that some text sequence representation methods developed in the field of NLP (Natural Language Processing) may apply to the RNA sequence. To be more specific, Gene2vec uses Word2vec [25] and Convolutional Neural Network (CNN) to predict m6A sites [26]. DeepPromise uses ENAC, One-hot [27], and RNA embedding [28] to achieve feature encoding of RNA sequences, and then integrates CNN model scores to achieve m6A site prediction [29]. By integrating BGRU with word embedding and a Random Forest classifier with a novel encoding of enhanced nucleic acid content (ENAC), BERMP can better identify m6A sites, which demonstrates that the deep learning framework is more suitable for addressing the prediction task with larger datasets [30]. However, the prediction performance of existing methods can still be improved. Thus, this paper further proposes an ensemble deep learning m6A site predictor EDLm6APred based on a recurrent neural network framework. It uses three encoding methods, including One-hot, RNA word embedding as well as Word2vec to depict RNA sequences. Based on the vectorized sequence representation obtained by the above-mentioned encoding methods, bi-directional long short-term memory (BiLSTM) is then constructed to achieve feature extraction and site prediction simultaneously. Finally, the prediction of m6A modification sites was completed by weighted integration of three prediction scores figured by the BiLSTM model trained with three different feature encodings. Fivefold cross-validation experiments on 3 independent test sets were conducted, with metrics such as the area under the ROC curve (AUROC), accuracy (ACC), precision (Precision), recall (Recall), and Matthews correlation coefficient (MCC) were calculated to compare with the performance of state-of-the-art methods such as Gene2vec and DeepPromise.

Results

Performance evaluation

In this paper, we adopted widely used evaluation indexes to evaluate the performance of EDLm6APred, including Area Under the Receiver Operation Curve (AUROC), Precision, Recall, Accuracy (ACC), and the Matthews correlation coefficient (MCC). These are the most widely used metrics for binary classifier evaluation, and the definition of ACC, Precision, Recall, and MCC are given in (1–4) [31, 32].

$$precision = \frac{TP}{{TP + FP}}$$
(1)
$$Recall = \frac{TP}{{TP + FN}}$$
(2)
$$Acc = \frac{TP + TN}{{TP + TN + FP + FN}}$$
(3)
$$MCC = \frac{TP \times TN - FP \times FN}{{\sqrt {(TP + FP) \times (TP + FN) \times (TN + FP) \times (TN + FN)} }}$$
(4)

where TP refers to true positives, counting the number of positive samples that are truly predicted as positive. TN refers to true negatives, indicating the number of correctly classified negative samples. FP refers to false positives, which is the number of negative samples that are incorrectly classified as positive. FN is false negatives, which refers to the number of positive samples that are incorrectly classified as negative.

Results analysis

In this paper, we first evaluated the effect of different sequence pre-processing methods, different sequence representation methods, and commonly used deep learning models on the prediction results respectively. Then, we evaluated the performance of the EDLm6APred predictor. Finally, we also compared our method with the newest predictor of m6A sites.

First, we tested three different sequence pre-processing methods based on human data set to compare their impact on model performance, which are overlapping equal length, overlapping variable length, and non-overlapping equal length. For example, given a sequence:

$${\text{AGGTCAGCATGC}}$$

In the processing of overlapping equal-length, a sliding window of size 3nt was used to slide on the sequence with one stride. Finally, we obtained a series of sub-sequences composed of 3 bases. The processing result of the above hypothetical sequence is as follows:

$${\text{AGG}}\;{\text{GGT}}\;{\text{GTC}}\;{\text{TCA}}\;{\text{CAG}}\;{\text{AGC}}\;{\text{GCA}}\;{\text{CAT}}\;{\text{ATG}}\;{\text{TGC}}$$

In the processing of overlapping variable length, K was sampled from the discrete uniform distribution Uniform (Klow, Khigh) to determine each window’s size. In this paper, we set Klow = 3 and Khigh = 5. The processing result of the above hypothetical sequence is as follows:

$${\text{AGG}}\;{\text{GGTC}}\;{\text{GTC}}\;{\text{TCAGC}}\;{\text{CAG}}\;{\text{AGCA}}\;{\text{GCAT}}\;{\text{CATGC}}$$

In the processing of non-overlapping equal length, a sliding window of size 3nt was used to slide on the sequence with three strides. Finally, we obtained a series of sub-sequences composed of 3 bases. The processing result of the above hypothetical sequence is as follows:

$${\text{AGG}}\;{\text{TCA}}\;{\text{GCA}}\;{\text{TGC}}$$

After pre-processing, all the sub-sequences produced by the above three methods are fed into the Word2vec based predictor for further site identification. The ROC curves are shown in Fig. 1. It shows that the performance of prediction with overlapping equal length method is better than the others. Therefore, the overlapping equal length method was used to complete the sequence pre-processing in the following experiments.

Fig. 1
figure 1

ROC curves of different sequence pre-processing methods on the human independent test set. The sequence pre-processing methods are overlapping equal length, overlapping variable length, and non-overlapping equal length respectively

Next, the BiLSTM model has been compared with the LSTM and CNN models. This group of experiments adopted Word2vec to represent sequences, which were denoted as CNNWord2vec, LSTMWord2vec, and BiLSTMWord2vec respectively. The evaluation results and ROC curves of the fivefold cross-validation on the human data set are shown in Table 1 and Fig. 2. It shows that the AUROC of the three models from high to low is BiLSTMWord2vec, LSTMWord2vec, and CNNWord2vec. LSTMWord2vec is nearly 2% higher than CNNWord2vec, and BiLSTMWord2vec is nearly 1% higher than LSTMWord2vec. The reason may be that the essence of CNN is to extract the local features of the sequence while ignoring the context. However, the LSTM and BiLSTM model based on RNN can better capture the interaction between distant elements in the sequence and obtain the relative position relation between each sub-sequence. Thus, they can extract the global features of the sequence. Besides, the pooling layer after CNN may lead to the loss of important location information. In addition, BiLSTM performs better than LSTM, possibly because BiLSTM is composed of forward LSTM and backward LSTM, which can capture context information simultaneously, while one-way LSTM may capture upstream or downstream information only.

Table 1 Evaluation results of different deep learning models
Fig. 2
figure 2

ROC curves of different deep learning models on the human independent test set. The deep learning models are CNN, LSTM, and BiLSTM respectively

Besides, the prediction performance of the three different feature encoding methods was compared. This group of experiments firstly encoded the sequences by One-hot, RNA word embedding, and Word2vec respectively, then adopted the same BiLSTM classifier framework for further site identification. The performance all went through the above procedure. A fivefold cross-validation experiment was carried out on the human data set. The ROC curves on the independent test set are shown in Fig. 3. It can be seen that the AUROC of Word2vec based model achieves 0.8510, which is higher than RNA word embedding and One-hot based ones.

Fig. 3
figure 3

ROC curves of different sequence encoding modes on the human independent test set. The sequence encoding modes are One-hot, RNA word embedding, and Word2vec respectively

These three encoding methods represent sequences from different perspectives. In this paper, a deep prediction model EDLm6APred was constructed to perform weighted integration of the three predictors. fivefold cross-validation experiments were conducted on the human data set, mouse data set, and mixed data set of human and mouse respectively. The results in the independent test set are shown in Table 2. All the performance of EDLm6APred is superior to any single predictor. The integration of the three predictors not only considers the location information of the sequence but also considers its context information, which achieves the complementary advantages.

Table 2 Evaluation results of single predictor and integrated predictor based on different species

This paper compared EDLm6APred with DeepPromise. We replaced the CNN model in DeepPromise with BiLSTM to construct BiLSTMDeepPromise and replaced the ENAC encoding in BiLSTMDeepPromise with Word2vec to construct our EDLm6APred predictor. Fivefold cross-validation experiments were conducted on the human data set, mouse data set, and mixed data set of human and mouse respectively. The results are shown in Table 3.

Table 3 Compare with DeepPromise predictors

The AUROC of EDLm6APred is significantly better than DeepPromise and BiLSTMDeepPromise. Since ENAC encoding only considers the nucleic acid composition and position information of the sequence but fails to consider the more in-depth semantic information of the sequence, while Word2vec can better represent the sequence. In addition, BiLSTM is more suitable to capture the features of the RNA sequence than CNN.

Discussion

In this paper, the m6A site predictor EDLm6APred was constructed based on the word embedding algorithm and Bi-directional Long Short-Term Memory Recurrent Neural Network to explore various RNA sequence pre-processing and feature encoding methods. We compared Three data pre-processing methods, including overlapping equal length, overlapping variable length, and non-overlapping equal length. Finally, the overlapping equal length method was selected to complete the pre-processing of the RNA sequence. Then, we obtained the feature representation of the sequence by three encoding methods of One-hot, RNA word embedding, and Word2vec. Moreover, we compared the effect of three deep learning models respectively on the site prediction performance, including CNN, LSTM as well as BiLSTM. The experimental results showed that the BiLSTM model can significantly improve the prediction performance. Considering that different encoding approaches depict the sequence from different perspectives, which may be complementary to each other, EDLm6APred combined the former mentioned encoding methods followed by the BiLSTM model together with weights to obtain the final prediction.

Conclusions

The contribution of this paper lies in the proposition of an m6A site predictor EDLm6APred under a deep recurrent neural network framework. In this paper, different RNA sequence feature encoding methods were employed to decipher RNA sequences more thoroughly, and the BiLSTM model was employed to better take advantage of contextual information for m6A site prediction.

Methods

Data and its sequence representation

This paper is based on the two sets of human and mouse data sets established by Zou et al. Both data sets obtained complementary DNA (cDNA) sequence data from the Ensemble database [33]. After obtaining mRNA sequences through reverse complementation, sequences that were not GAC or AAC motif in the center were removed, and sequences shorter than 1001nt were filled with the character “X”. Finally, the sequences of the data sets used for algorithm training are 1001nt, and the proportion of positive and negative samples is 1:1. See the data sets on the webserver www.xjtlu.edu.cn/biologicalsciences/EDLm6APred for details.

The effective feature encoding method determines the performance of the site prediction model. The sequences are first encoded in the way of one-hot, RNA word embedding, and Word2vec respectively. One-hot and RNA word embedding are standard approaches for RNA sequence encoding. High-dimensional sparse binary word vector and low-dimensional dense word vector are obtained to characterize RNA modification sites. Word2vec can effectively extract relevant semantic features according to the upstream and downstream context of the base, then translate them into word vector expression.

5-dimensional binary vectors are introduced as one-hot encoding to represent each single base in the RNA sequence, corresponding to four nucleotides and the filling character “X” respectively. To be specific, A = [1, 0, 0, 0, 0], T = [0, 1, 0, 0, 0], G = [0, 0, 1, 0, 0], C = [0, 0, 0, 1, 0] and X = [0, 0, 0, 0, 1]. Therefore, each sequence of 1001 bps in the dataset is converted to binary vectors of 5005 bits.

Following the idea of RNA Word embedding coding, a 3nt window is used to slide over for each sequence to obtain 999 sub-sequences composed of 3 bases. Finally, 105 different sub-sequences and the unique integer indexes corresponding to the 105 sub-sequences in the dictionary are obtained. A unique integer index represents the pseudo-RNA word, and each pre-processed sequence is converted into an integer sequence with a corresponding integer index, then fed into the embedding layer. Therefore, a sequence of 1001nts in the dataset is converted into a matrix of 999 × 100, where 100 is the dimension of the word vector.

Word2vec encoding can be achieved following CBOW or Skip-gram models. The CBOW model is usually used to predict the current word based on its context, while the Skip-gram model predicts the context based on the current word. CBOW model is known to run faster than the skip-gram model in training. Besides, the number of data sets used in our experiment is relatively large, and the types of words in the corpus are small (105 types). There are no uncommon words and words with low frequency. Thus, the CBOW model is followed to encode RNA sequences in this paper. To be more specific, the sequences are first divided into sub-sequences of length 3nt by overlapping equal length, then the CBOW model is used for training. Therefore, each sub-sequence is transformed to represent the semantic word vector, and then the obtained word vector is used to represent the sequence of 1001nt in the data set into a matrix of 999 × 100. The input and output of these encoding methods are shown in Table 4.

Table 4 Three feature encoding input and output formats

BiLSTM

BiLSTM is developed from RNN (Recurrent Neural Network) and consists of two parts, the forward LSTM(Long Short-Term Memory) layer and the backward LSTM layer [34, 35]. Its structure is shown in Fig. 4. The forward calculation is performed from moment 1 to moment t in the forward LSTM layer to obtain and save the forward hidden layer's output at each moment. At the same time, the backward calculation is performed from moment t to moment 1 in the backward LSTM layer to obtain and save the backward hidden layer's output at each moment. Finally, the final output is obtained at each moment by combining the output results of the forward LSTM layer and the backward LSTM layer at corresponding moments.

Fig. 4
figure 4

BiLSTM model diagram, which consists of the forward LSTM layer and the backward LSTM layer

For basic LSTM structure, a set of memory units are employed to learn when to forget historical information and when to update, as shown in Fig. 5. At moment t, the memory unit Ct records all historical information up to the current moment, and it is also controlled by three “gates”: the forgetting gate ft, the input gate it, and the output gate ot. The forgetting gate ft determines what information to discard from the cellular state, as shown in (5). It views ht−1 (the previous hidden state) and xt (the current input), then prints a number between 0 and 1 for each number in the state Ct−1 (the previous state), with 1 being wholly retained and 0 being completely deleted.

$$f_{t} = \sigma (W_{f} \cdot [h_{t - 1} ,x_{t} ] + b_{f} )$$
(5)
Fig. 5
figure 5

LSTM model diagram, which is controlled by three “gates”: the forgetting gate ft, the input gate it, and the output gate ot. Besides, a set of memory units are employed to learn when to forget historical information and when to update

The input gate determines what information is stored in the cellular state. First, the input gate’s Sigmoid activation function determines which values we will update, as shown in (6).

$$i_{t} = \sigma (W_{i} \cdot [h_{t - 1} ,x_{t} ] + b_{i} )$$
(6)

Then, an activation function tanh() creates a candidate vector \(\tilde{C}_{t}\)(new information), which will be added to the cell state, as shown in (7). Finally, combine the two vectors to create the updated value.

$$\tilde{C}_{t} = \tanh (W_{c} \cdot [h_{t - 1} ,x_{t} ] + b_{c} )$$
(7)

Update the last status value Ct−1 to Ct. Multiply the previous state value Ct−1 by ft to indicate what we expect to forget. Then add the obtained value \(i_{t} * \tilde{C}_{t}\) and get the new state value Ct, as shown in (8).

$$C_{t} = f_{t} * C_{t - 1} + i_{t} * \tilde{C}_{t}$$
(8)

The output gate determines what to output, and this output will be based on the current cell state. First, a Sigmoid activation function is used to determine which parts of the cell state we want to output, as shown in (9).

$$o_{t} = \sigma (W_{o} \cdot [h_{t - 1} ,x_{t} ] + b_{o} )$$
(9)

Then, the cell state is passed tanh to normalize the value between − 1 and 1 and multiplied by the output of the output gate to complete the output of which part of the information is determined by the output gate, as shown in (6).

$$h_{t} = o_{t} * {\text{tanh}}(C_{t} )$$
(10)

where xt is the input of the time network. ft, it and ot represent the states of forgetting gate, input gate, and output gate respectively. Wf, Wi, Wc, Wo and bf, bi, bc, bo represent weight matrix and deviation vector respectively.

LSTM has shown great advantages in modeling time series data due to its design characteristics, which can effectively solve long-term dependence and gradient disappearance existing in standard recurrent neural networks [36]. BiLSTM, by combining forward and backward LSTM, not only solves the gradient disappearance or gradient explosion problem but also fully considers the meaning of the current base fragment context [37].

m6A site prediction based on BiLSTM

Three m6A site predictors are constructed by combining the BiLSTM and three sequence feature encoding methods, such as One-hot, RNA word embedding, and Word2vec respectively. Take Word2vec as an example, the predictor adopts a five-layer architecture, including the input layer, BiLSTM layer, flattening layer, full connection layer, and prediction layer, among which the input layer handles data pre-processing, as shown in Fig. 6.

Fig. 6
figure 6

Predictors based on Word2vec and BiLSTM circular neural network, which adopts a five-layer architecture, including the input layer, BiLSTM layer, flattening layer, full connection layer, and prediction layer, among which the input layer handles data pre-processing

The Word2vec model trains the pre-processed sequences, and the word vectors of each pseudo-RNA word are obtained to form a dictionary. Then, each sequence’s subsequence is represented by the corresponding word vector in the dictionary, and the feature matrix of 999 × 100 is finally obtained, which is exactly the input of the BiLSTM layer. BiLSTM has the memory capacity to learn the long-term context-dependence of sequence and extract the global features of sequences. To avoid overfitting, the dropout [38] module is adopted in the BiLSTM layer with the “drop” probability being 0.2. The data was then flattened into one dimension, followed by a full connection layer for final output. The full connection layer in Fig. 6 consists of three full connections, consisting of 256, 128, and 64 neurons, respectively, which helps to improve the complexity of the model. More full connection layers, the nonlinear expression ability of the model can be improved, such that the learning ability of the model is improved. Each neural is activated by ReLU [39] function, and dropout is also employed with 0.5 dropout probability. Finally, Sigmoid [40] defined in (11) is adopted to predict the probability of the existence of m6A sites in the given sequence.

$$\hat{y}(x) = sigmoid(x) = \frac{1}{{1 + e^{ - x} }}$$
(11)

m6A site prediction based on ensemble integration

As is known, different feature encoding method views sequence from different perspectives. One-hot and RNA word embedding describe the specific information of the RNA modification site in the sequence window. Word2vec utilizes an external neural network to thoroughly learn the semantic information between the context of the sequence. Thus, different predictors may take complementary effects on prediction performance. Therefore, an ensemble predictor named EDLm6APred based on One-hot, RNA word embedding, and Word2vec followed by BiLSTM is formulated, and the structure is shown in Fig. 7. With three predictors with different encodings, it aims to represent the sequences from more thorough perspectives. The weighted weights of the three predictors are obtained by the grid search method.

Fig. 7
figure 7

EDLm6APred model architecture. The figures showed the workflow of our method. The mRNA sequences were predicted by three different deep learning classifiers. Then they ensemble vote for the final results