Background

Over 170 chemical modifications have been discovered in both coding and non-coding RNAs to date [1,2,3]. 5-cytosine-methylation is one of the most common post-transcriptional modifications (PTCM) and has been found in almost all types of RNA [4, 5]. This modification can regulate nuclear mRNA output and RNA variable splicing, increase RNA stability, regulate protein translation and RNA–protein interaction, and maintain the normal structure of RNA [6,7,8,9,10,11,12,13,14]. Under the catalysis of RNA methyltransferase, methylation occurs on the carbon atom in the fifth position of a cytosine to generate 5-methylcytosine (m5C). Therefore, accurate identification of m5C sites in RNA is of great importance for understanding the mechanism and function of this modification.

Both experimental and computational methods have been developed to determine and predict m5C sites in RNA. Experimental methods such as bisulfite sequencing [5, 12], m5C-RIP [15], Aza-IP [16], mi-CLIP [17] and RBS-seq [18] have been somewhat successful in identification of m5C sites in RNAs of different species. However, these experimental methods are time-consuming and expensive, and they are not able to keep pace with the explosive increase of RNA sequences revealed by the rapid development of sequencing technology. Instead, computational methods can be able to provide a faster and more cost-effective way for m5C site identification.

So far, eight computational methods for predicting m5C site have been reported, which were summarized in Table 1 according to the datasets, algorithms, webservers, evaluation strategies and features employed. Feng et al. [19] built their model using a support vector machine based on PseKNC features extracted from RNA segments, and a balanced dataset with 120 m5C sites and 120 non-m5C sites was used to build this model. In addition, nine other datasets with 120 non-m5C sites were randomly selected to demonstrate their model is not sensitive to the selection of non-m5C sites. Later, Qiu et al. [20] have developed a model called iRNAm5C-PseDNC to predict m5C sites by using random forests. Differently, this model was built on an imbalanced and redundant dataset with 475 m5C sites and 1425 non-m5C sites. Then, by using ensemble learning methods, Zhang et al. [21] have developed a model called m5C-HPCR. A new heuristic algorithm was introduced to reduce the number of physical and chemical properties of nucleotides. The m5C-HPCR was benchmarked on both Feng et al.’s dataset and Qiu et al.’s dataset. Sabooh et al. [22] have developed a model by fusing composite encoding features including Di-Nucleotide Composition (DNC), Tri-Nucleotide Composition (TNC) and Tetra- Nucleotide Composition (TetraNC). The same dataset as that of Feng et al. [19] and Zhang et al. [21] was again used to build this model by using SVM. Recently, Fang et al. [23] compared the balanced dataset used in Feng et al.’s work and the imbalanced dataset used in Qiu et al.’s work, and developed a model named RNAm5CPred to predict m5C sites of H. sapiens. The model was built by SVM and an independent test set was used to evaluate different methods. A new predictor (PEA-m5C) developed by Song et al. [24] mainly focused on predicting m5C sites in A. thaliana. The model was built by using random forests on an imbalanced dataset but was tested on three balanced independent datasets. Li et al. [25] had collected data from GEO database and developed a web server RNAm5Cfinder based on random forest algorithm, which can be used to predict m5C sites in eight kinds of cells or tissues of mouse and human. All the m5C sites recorded in three GEO records and all other non-m5C sites in the genomes were collected to train their models, however, the redundancy of the datasets was not well dealt with. More recently, Lv et al. [26] developed a server called iRNA-m5C to predict m5C sites of four types of species. Their models are built with random forests with features of PseKNC, MNBE (mono-nucleotide binary encoding), KNFC(K-tuple nucleotide frequency component) and NV (natural vector).

Table 1 Summarization of the existing methods for predicting m5C sites of RNA

Although these reported methods performed well in the recognition of m5C sites in animal and plant RNA sequences, it is possible that the performance can be improved by introducing position specific related features such as position specific nucleotide propensity (PSNP), position specific dinucleotide propensity (PSDP). The effectiveness of these features has been proved in previous works [27, 28] for predicting m6A of RNA, however, the use of these features to predict m5C sites has not been explored in these methods mentioned above. It is expected that the performance of computational methods can be further improved by mining position specific related features and composition related features.

In this study, we have developed a new method, m5CPred-SVM, to predict m5C sites in RNA sequences of three different species, H. sapiens, M. musculus and A. thaliana. First, we generated six kinds of features based on RNA sequences, namely k-nucleotide frequency (KNF), pseudo dinucleotide composition (pseDNC), K-spaced nucleotide pair frequency (KSNPF), position-specific nucleotide propensity (PSNP), K-spaced position-specific dinucleotide propensity (KSPSDP) and Chemical Property with Density (CPD). Then, the sequential forward feature selection strategy was used to select a compact feature subset from these six kinds of features. Based on this selected feature subset, our method was built using a support vector machine (SVM). At last, the performance of our method was compared with several existing methods. The results showed that our method can offer substantially better performance than these existing methods on the independent test sets.

Results

Performance of each type of feature

By using SVM over the ten folds cross-validation, we have evaluated the performances of the six types of extracted features for the three species, namely H. sapiens, M. musculus and A. thaliana. As shown in Table 2, PSNP, KSPSDP, CPD are the three features showing the best performances among the six types of features for H. sapiens. The cross-validation AUROCs for these three features are 0.879, 0.862 and 0.850, respectively. Table 3 shows that CPD, KSPSDP and PSNP are the three features providing the best performances for M. musculus. The cross-validation AUROCs are 0.812, 0.803 and 0.794 for these three features, respectively. As for A. thaliana, Table 4 shows that the top three models with the best performances were based on PseDNC, 4NF and CPD, and the corresponding cross-validation AUROCs are 0.760, 0.759 and 0.753. The ROC curves of the six types of features for H. sapiens, M. musculus and A. thaliana are shown in Fig. 1.

Table 2 The results of feature selection for H. sapiens
Table 3 The results of feature selection for M. musculus
Table 4 The results of feature selection for A. thaliana
Fig. 1
figure 1

The ROC curves that show the performances of the six type of features for H. sapiens, M. musculus and A. thaliana, respectively. a H. sapiens; b M. musculus; c A. thaliana

Feature subsets selected by SFS

Considering the fact that different features may be complementary, combination of the six generated features may improve the predictive performance. However, there are also redundancy between these features, and a high dimensional input feature can make the model training very time-consuming and easily over-fitting. In order to solve the problem, we have used the sequential forward feature selection (SFS) strategy to select a compact feature subset from these features to build our final models.

As shown in Table 2, the cross validation accuracy was convergent at the fourth round in the SFS process for training the model of H. sapiens. The highest AUROC is 0.899, and the corresponding feature subset includes PSNP, 4NF, 5SNPF and PseDNC.

For M. musculus, the cross validation accuracy was convergent at the third round in the SFS process (Table 3). The highest AUROC is 0.822 and the corresponding feature subset includes CPD, 4NF and 1SPSDP.

As for A. thaliana, Table 4 shows that the cross validation accuracy is convergent at the fifth round in the SFS process. The highest AUROC is 0.782 and the corresponding feature subset includes PseDNC, 3SPSDP, 4NF, 1SNPF and PSNP.

Model sensitivity to the selection of negative samples

To evaluate if the selection of the negative samples affects the predictive performances of the models, we built other nine models based on positive samples and other nine negative subsets for both H. sapiens and M. musculus with the optimal feature subsets. Additional file 1: Tables S1 and S2 show cross-validation performances of the 10 models built on the positive samples and the 10 negative subsets for H. sapiens and M. musculus, respectively. The means and the standard errors of the ROC AUCs of the ten models are 0.843 and 0.029, and 0.822 and 0.003, for H. sapiens and M. musculus, respectively. Additional file 1: Tables S3 and S4 show the performances of the ten models on the independent test sets of H. sapiens and M. musculus, respectively. The means and the standard errors of the ROC AUCs of the ten models are 0.834 and 0.024, and 0.776 and 0.007, for H. sapiens and M. musculus, respectively. The results indicate the performance of the models is affected a little for H. sapiens by the selection of negative samples, however, the performance is barely affected for M. musculus by the selection of negative samples. The main possible reason is that the dataset for H. sapiens is smaller compared with that of M. musculus. The distribution is easily fluctuated for small samples.

Comparison with other classifiers

Studies above has showed that support vector machine performed well in predicting m5C sites for different species. In order to further investigate and compare the performance of other classifiers, we used other five classifiers, namely KNN [29], Adaboost [30], random forests [31], decision tree [32], logistic regression [33] and XGBoost [34] to build models based on the selected feature subsets for all the three species. The hyper parameters for KNN, Adaboost, random forests and XGboost were also optimized with grid search. The \(k\) of KNN is set from 1 to 10 with a step 1. The ntree of 10 to 1000 with a step 20 is set for both Adaboost and random forests. The learning rate, max depth and nrounds of XGboost are set between 2−4 and 2−1, 2 and 10, and 23 and 210, respectively. Table 5 shows the cross validation results of the six classifiers. For H. sapiens, the AUCROC value of SVM is 0.899, which is higher than those of XGBoost, RF, KNN, AdaBoost, decision tree and LR at 0.020, 0.050, 0.049, 0.039, 0.221 and 0.282, respectively. Moreover, the SVM model achieved the highest values in all other metrics. For M. musculus, the AUCROC value of SVM is 0.822, which is again higher than those of RF, KNN, AdaBoost, decision tree and LR at 0.008, 0.093, 0.010, 0.207 and 0.011, respectively, but a little bit less than XGBoost (0.823). For A. thaliana, SVM again gave the highest AUC value at 0.782, which is higher than those of XGBoost, RF, KNN, AdaBoost, decision tree and LR for 0.012, 0.004, 0.048, 0.026, 0.195 and 0.052, respectively. For other metrics, KNN has the highest Sp value and RF has the highest Pre value, while the Sn, Acc, MCC and F1 score value of SVM offered the highest values for all the remaining 4 metrics. The fact that SVM has outperformed all other six classifiers for both H. sapiens and A. thaliana and is comparable to XGBoost for M. musculus further confirms that SVM is a stable and robust classifier. As a result, SVM was selected as the final classifier in this study.

Table 5 Comparison of different classifiers based on the cross-validation results on the training datasets for the three species

Comparison with other existing methods

In this study, we have also compared our methods with some other existing m5C site prediction methods [19,20,21,22,23,24,25,26]. Because different benchmark datasets have been used for building different methods, independent test sets were used to ensure the objectiveness of the comparison. These independent test sets were only used for comparison and not for building our models. At present, four methods are available to identify m5C sites of H. sapiens, namely RNAm5Cfinder [25], iRNA-m5C [26], iRNAm5C-PseDNC [20] and RNAm5CPred [23]. Two methods are available for predicting m5C sites of M. musculus, namely RNAm5Cfinder [25] and iRNA-m5C [26]. Two methods are available for detecting m5C sites of A. thaliana, namely PEA-m5C [24] and iRNA-m5C. Table 6 shows the predictive results of these methods on the independent test sets for the three species, and Fig. 2 shows the relevant ROC curves and PRC curves. For H. sapiens, iRNAm5C-PseDNC has the highest Sp value (0.971), while our method gives significantly higher values for Sn, Pre, Acc, MCC, F1 score and AUROC when compared with other methods. For M. musculus, other than Sp and Pre, again our method has the highest values for all the remaining metrics (Sn, Pre, Acc, MCC, F1 score and AUROC). For A. thaliana, our method gives the highest value for all the metrics. All these results have indicated that our methods performed better than other existing methods in predicting m5C sites.

Table 6 Comparison with existing methods on the independent test sets
Fig. 2
figure 2

ROC curves and PRC curves for our model and other models on the independent test sets. Upper panel: H. sapiens; Lower panel: M. musculus

Cross-species verification

In this study, models were built for H. sapiens, M. musculus and A. thaliana individually. It will be of great interest to evaluate the species-specificity and transferability of these models using the cross-species verification. To achieve that, the three models built on the three species-specific m5C training data sets were further tested on the three independent test datasets. Figure 3 shows the test results. Firstly, all the three models performed well on its own independent test sets (see the diagonal of Fig. 3). Secondly, the models of H. sapiens and M. musculus both performed poorly on the independent test set of A. thaliana, and vice versa. One possible reason to explain this result is because H. sapiens and M. musculus are both mammals while A. thaliana is plant. This reason is supported by Fig. 4, which shows that the nucleotide distribution in the sequence of A. thaliana is different from that of H. sapiens and M. musculus. Thirdly, the M. musculus-specific model performed well on the H. sapiens-specific independent test set, however, the H. sapiens-specific model performed poorly on the M. musculus-specific independent test set. It might be because the H. sapiens-specific datasets have smaller sizes than the M. musculus-specific datasets, and the smaller datasets sizes limit the variety of sequences which confines the transferability of the H. sapiens-specific model.

Fig. 3
figure 3

The heat map showing the cross species prediction accuracies. Once a species-specific model was established on its own training dataset in rows, it was validated on the data from the same species as well as the independent data from the other two species in columns

Fig. 4
figure 4

The nucleotide distribution around m5C and non-m5C sites. The top panel of the x-axis is for m5C site containing sequence, while the down panel of the x-axis is for non-m5C site containing sequences. a H. sapiens; b M. musculus; c A. thaliana

Web implementation

For the convenience of researchers, a user-friendly and publicly accessible web server was built to implement our method, which is available at https://zhulab.ahu.edu.cn/m5CPred-SVM. Users can predict the m5C sites on this server without complicated calculation. The detailed procedure to use the web server is as below:

To start with, users need to choose one from the three species, H. sapiens, M. musculus and A. thaliana. After that, users can type the query RNA sequences into the input box or upload a FASTA format file (Note that the input sequence should be in FASTA format and the length of each query sequence should be longer than 41 bp). Then, by clicking the 'submit' button, the system will do the calculation and give the final result. In the backend, the server would find the cytosine in the query sequence. All cytosine-centric RNA fragment would be extracted with flank size equals to 20 and the missing nucleotides would be filled by ‘N’. There might be lots of cytosines in a sequence, and our predictive model will reconstruct the sequence separately for each of them. The server home page also contains our contact information for users to contact us in case they have problems with the server or have suggestions.

Discussion

Our study shows that the position specific related features can be effective features for discriminating m5C sites from non-m5C sites. Theoretically speaking, the difference of nucleotide distribution between RNA sequences containing m5C sites and those without m5C sites determines how well we can discriminate them. In other words, the nucleotide distribution around the m5C site may have a certain preference. In order to investigate the nucleotide distribution preference for each sequence position, we adopted Two Sample Logo tool [35] to conduct visualization of the nucleotide site preference around m5C and non-m5C sites in the three species. Figure 4 clearly shows that significant difference does exist in nucleotide distribution around the m5C sites and the non-m5C sites for these three species, and the difference was found to descend in the sequence of H. sapiens, M. musculus and A. thaliana according to the depleted ratio (see Y axis of Fig. 4). It is shown that the depleted ratio of H. sapiens is from −28.6 to 28.6% and the depleted ratio of M. musculus is from −24.2 to 24.2%, which means the differences of nucleotide position preferences between positive and negative samples of the two species are significantly different. However, the corresponding depleted ratio of A. thaliana is from −11.9 to 11.9%. This is in line with our results that the six types of features for H. sapiens performed better than those for M. musculus, and the features for A. thaliana performed worst. The sequence differences observed here may account for the performance difference of the six types features observed before for the three species.

In addition, this figure can also explain why PSNP, KSPSDP performed well among the six types of features for both H. sapiens and M. musculus, while PseDNC and 4NF achieved best accuracy for A. thaliana. Among these six types of features, PSNP and KSPSDP are the two features that consider position preference information. As mentioned before, both H. sapiens and M. musculus have high position preferences of nucleotide in RNA sequences, thus it is not surprising that PSNP and KSPSDP performed best for these two species. On the contrary, the position preferences of RNA sequences of A. thaliana are not as significant as those of H. sapiens and M. musculus, so the two features, PSNP and KSPSDP, did not performed as well as they did for H. sapiens and M. musculus.

KNF, KSNPF and pseDNC are three features related to nucleotide composition of RNA segments. KNF can describe the local sequence-order information of nucleotide sequences. The idea of KSNPF is to calculate the frequency of sixteen pairs of nucleotides spaced by K-length polynucleotides. With increasing of K, KSNPF feature takes the position correlation information into account within the nucleotide sequence. PseDNC feature contains both local and global sequence-order information. The performances of these three features are determined by the composition difference between positive and negative samples.

The CPD feature contains nucleotide information at each position of the RNA segments and it also contains the nucleotide composition information along the RNA sequences, so that it performs well for all these three species.

According to our results, models based on these selected feature subsets selected by SFS had made improvements of about two percents in performance when compared with models based on single feature. As all of these selected feature subsets are combinations of the position specific features and the composition features, the improvements observed here can further confirm the complementarity between these two groups of features. It should be noted that we tried a large number of other types of features generated by iLearn [36] or Pse-in-One [37] toolkits when we designed the input features (data not shown). The sequence-based features generated by these two toolkits have been used widely for predicting both RNA post-transcriptional modification sites [38,39,40] and post-translational modification sites [41, 42]. Our experimental results demonstrated that our proposed feature combination in this study yielded satisfactory performance, which cannot be significantly improved when they were combined with other features.

We summarized the possible reasons for our method to outperform other existing methods. For the benchmark datasets, we used larger training sets for H. sapiens and M. musculus than iRNA-m5C which is the latest model for multiple species. Large datasets are helpful for improvement of the generalization of models. In addition, we added two types of position specific propensity features, PSNP and KSPSDP. Our results (Tables 2, 3 and 4) demonstrate PSNP and KSPSDP have played key roles in improving method performance.

Conclusion

In this study, a new computational method, m5CPred-SVM, was developed for predicting m5C sites in RNA sequences. Non-redundant large benchmark datasets were collected for three species, namely H. sapiens, M. musculus and A. thaliana. A total of six types of features, including features related to composition, features related to position specific and features related to physicochemical properties were used in building our models. Results have showed that the features related to position specific are effective in differentiating m5C sites from non-m5C sites for H. sapiens and M. musculus. Nucleotide distribution analysis reveals that nucleotide position preferences are significant for both H. sapiens and M. musculus, which account for the effectiveness of the features related to position specific propensity. For the same reason, the features related to position specific propensity are not that effective for A. thaliana because the nucleotide position preferences are less significant compared with that for the other two species. Optimal feature subsets were selected from these six types of features using the sequential forward feature selection strategy. All the three subsets consisted of feature related position specific propensity and feature related to nucleotide composition which indicate the complementarity between the features. The performance of our method m5CPred-SVM was objectively compared with other existing methods by using independent test sets. The results showed that our method can offer significantly better performances than all the other existing methods. Finally, a web server was built at https://zhulab.ahu.edu.cn/m5CPred-SVM to facilitate the access to our method by academic users to predict the m5C sites in RNA sequences.

Methods

Benchmark datasets

High quality benchmark datasets are extremely important for training and evaluating machine learning models. In this study, m5C data of three species have been collected from recently published literature. For A. thaliana, same datasets constructed by Lv et al. [26] were used for fair comparison. The positive RNA segments which contain m5C site in the center were collected from NCBI Gene Expression Omnibus (GEO) database with the accession number GSE94065 [43]. This dataset contains 6289 positive samples and 6289 negative samples.

The positive samples in the datasets of M. musculus and H. sapiens were obtained from the works of Yang et al. [44] and Vahid Khoddami et al. [18], respectively. For H. sapiens, we collected the data from the work of Vahid Khoddami et al. [18]. The file “GSE90963_Table_S1-m5C_candidate_sites.xlsx” was downloaded from GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE90963), which recorded both m5C sites information by their RBS-seq work and the m5C sites information in other public datasets. Firstly, we collected the sites with high-threshold in their RBS-seq work. Secondly, we collected the sites both reported in their RBS-seq work and the public datasets. Totally, 408 m5C sites were collected for H. sapiens. For M. musculus, we collected the data from Additional file 1: Table S3 of Yang et al.’s work [44]. The m5C sites detected in six different tissues are all considered as positive examples, thus we obtained 13042 RNA segments centered with m5C. In order to avoid bias of the datasets, similar sequences in the datasets were removed using the CD-HIT program [45] with the sequence identity threshold set at 70%, through which we have obtained 5563 and 269 positive samples for M. musculus and H. sapiens, respectively. In machine learning, the model performance may be degraded and the prediction results may be out of balance due to the inconsistency of the amount of data between the positive sample and the negative sample [46, 47]. Therefore, we have randomly selected the same number of negative samples as that of positive samples for the establishment of the benchmark dataset. It is worth noting that the redundancy of the negative examples was also removed using CD-HIT with the sequence identity threshold set at 70%. To verify if the model is sensitive to the selection of negative samples, we conducted the same procedure to generate other nine negative subsets for both H. sapiens and M. musculus. We have not done the same thing for A. thaliana because we did not know the details about the generation of negative samples of A. thaliana which were obtained from Lv et al.’s work.

The benchmark dataset is usually divided into two parts. One is the training dataset and the other is the independent test set. The training dataset is used for model construction, cross-validation and determination of hyper-parameters of the learning algorithms. The independent test dataset is used to test the performance and generalization ability of the model. In this study, 69 positive samples and 69 negative samples were randomly selected as the independent test dataset and the remaining 200 positive samples and 200 negative samples were used as the training dataset for H. sapiens. For M. musculus and A. thaliana, 1000 positive samples and 1000 negative samples were randomly selected as the independent test datasets, and the remaining samples (4563 positive samples and 4563 negative samples for M. musculus, 5298 positive samples and 5298 negative samples for A. thaliana) were used as the training datasets.

The fragment of each RNA in the datasets is represented as:

$$R_{\lambda } \left( C \right) = N_{ - \lambda } N_{{ - \left( {\lambda - 1} \right)}} \ldots N_{ - 1} CN_{1} \ldots N_{{ + \left( {\lambda - 1} \right)}} N_{\lambda }$$
(1)

where \(N_{ - \lambda }\) represents the upstream nucleotide of central cytosine and \(N_{\lambda }\). rresents the downstream nucleotide of central cytosine. In most previous works [19,20,21,22, 26], the length of the input RNA segments was set to 41 and the m5C site is located in the central position 21. In this study, we have also extracted features from the 41 bp long RNA segments.

The details of the training dataset and the testing dataset are shown in Table 7.

Table 7 The information of the datasets

Feature extraction

K-nucleotide frequency (KNF)

As a classic sequence coding feature, K-nucleotide frequency (KNF, also called NC (Nucleotide composition)) has been widely used to build bioinformatics models [48,49,50]. Suppose we have an RNA segment R of length L:

$$R = n_{1} n_{2} n_{3} \ldots n_{i} \ldots n_{L - 1} n_{L}$$
(2)

\(n_{i}\) indicates the ith nucleotide of R, and it can be any one of the four nucleotide bases in RNA, i.e.\(n_{i}\) \(\in\){A, C, G, U}. For a given K value, KNF represents the frequency of occurrence of each K-mer nucleotide component in the nucleotide sequence. It can be calculated by the formula (3).

$$f\left( {n_{1} n_{2} \ldots n_{k} } \right) = \frac{{N\left( {n_{1} n_{2} \ldots n_{k} } \right)}}{L - K + 1}$$
(3)

where \(n_{1} n_{2} \ldots n_{K}\) indicates a K-mer nucleotide component. It is not difficult to find that the K-mer nucleotide composition of an RNA sequence is a \(4^{K}\)-dimensional vector consisting of frequency of each K-mer type. As the value of K increases, the dimension of the feature vector increases exponentially. For example, when K = 1, four types of single nucleotide frequencies can be obtained. We chose the K value of 4 to calculate the frequency at which 4 nucleotides appears (4NF) according to a previous work [23]. The RNA fragment can be encoded as:

$$R\left( {4NF} \right) = \left[ {f_{AAAA} f_{AAAC} \ldots f_{GCUU} \ldots f_{UUUG} f_{UUUU} } \right]$$
(4)

K-spaced nucleotide pair frequency (KSNPF)

K-spaced nucleotide pair frequency is another method for encoding RNA sequences [51]. This method mainly calculates the frequency of 16 pairs of nucleotides separated by k-length polynucleotides. We use \(n_{1}\) × {K}\(n_{2}\) to represent K-spaced nucleotide pairs. Since \(n_{1}\) and \({ }n_{2} { }\) have four possible values, so there are sixteen (\(4^{2}\) = 16) possible combinations. For example: AxxC is a two spacer nucleotide pair. The calculation formula of KSNPF is

$$f\left( {n_{1} {\text{x}}\left\{ {\text{K}} \right\}n_{2} } \right) = \frac{{N\left( {n_{1} {\text{x}}\left\{ {\text{K}} \right\}n_{2} } \right)}}{L - K + 1}$$
(5)

In this work, we tried different K values in order to determine the best KSNPF features for different species. The selection of K for different species can be found in Additional file 1: Table S5.

Position-specific nucleotide propensity (PSNP)

In several previous works [27, 28, 51], position-specific nucleotide propensity has been used to predict the post-transcriptional modification of RNA. This feature is obtained by calculating the difference in nucleotide frequencies at specific positions between positive and negative RNA fragments. It was first introduced in Li et al.’s work [28]. According to Eq. (1), the RNA fragment can be re-expressed as:

$$R_{\lambda } = N_{1} N_{2} \ldots N_{i} \ldots N_{2\lambda + 1}$$
(6)

First, we calculated the frequency of the four nucleotides at the i-th position in the positive sample and the negative sample, respectively. After that, the 4-dimensional positive vectors and the 4-dimensional negative vectors were combined individually to obtain two \(4 \times \left( {2\lambda + 1} \right)\) position-specific occurrence frequency matrices for positive and negative samples, respectively. The two matrices were named as \(M^{ + }\) and \(M^{ - }\),\(M^{ + }\) is for positive samples and \(M^{ - }\) is for negative samples. Through \(M^{ + }\) and \(M^{ - }\),we defined the position-specific nucleotide propensity matrix, denoted as \(X_{PSNP}\), as below:

$$X_{PSNP} = M_{ + } - M_{ - }$$
(7)

K-spaced position-specific dinucleotide propensity (KSPSDP)

Position-specific dinucleotide propensity is defined using the similar procedure to define PSNP. To calculate this feature, we rewrite Eq. (6) as a dinucleotide:

$$R_{\lambda } = D_{1} D_{2} \ldots D_{i} \ldots D_{2\lambda }$$
(8)

where Di represents the dinucleotide at the i-th position of RNA and has 16 types of values. By using the similar way for calculating the PSNP feature, we can get the (\(16 \times 2\lambda\)) position-specific dinucleotide propensity (PSDP) matrix.

To calculate K-spaced position-specific dinucleotide propensity, \(n_{1}\) × {K}\(n_{2}\) was used to represent K-spaced nucleotide pairs. PSDP is a specific case for KSPSDP when K equals 0. In this work, we tried different K values to determine the best KSPSDP features for different species. The selection of K values for different species can be found in Additional file 1: Table S6.

Pseudo dinucleotide composition (PseDNC)

The pseudo K-tuple nucleotide composition (PseKNC) has been used to represent an RNA sequence with a discrete model or vector which can keep considerable sequence order information, especially the global or long-range sequence order information [20, 26, 52, 53]. In this study, we used PseDNC (K = 2 for PseKNC) to encode the RNA segments. Three physicochemical properties, free energy, hydrophilicity and stacking energy were used to generate features of PseDNC. The values of these three physicochemical properties of 16 dinucleotides are shown in Table 8.

Table 8 Three types of physicochemical properties of dinucleotides in RNA

Chemical property with density (CPD)

The four types of nucleotides in RNA (A (adenine), U (uracil), G(guanine) and C(cytosine)) can be divided into three categories according to their chemical structures and internal binding characteristics [54]. Considering the ring structure of the nucleotide, C and U are pyrimidines with one ring, while A and G are purines with two rings. As for the secondary structure, the hydrogen bonds of A and U are weak, while the hydrogen bonds of G and C are strong. In terms of chemical functionality, U and G are classified as keto groups, while A and C are in amino groups. These three aspects of chemical properties can be represented as a three-dimensional vector (x, y, z), where x, y, z represent the ring structure, the hydrogen bond, and the chemical functionality of the nucleotides respectively. In this way, each nucleotide \(n_{i} = \left( {x_{i} ,y_{i} ,z_{i} } \right)\) in an RNA sequence can be encoded as:

$$x_{i} = \left\{ {\begin{array}{*{20}c} {1 if n_{i} \varepsilon \left\{ {A,} \right.\left. G \right\}} \\ {0 if n_{i} \varepsilon \left\{ {U,} \right.\left. C \right\}} \\ \end{array} } \right., y_{i} = \left\{ {\begin{array}{*{20}c} {1 if n_{i} \varepsilon \left\{ {A,} \right.\left. C \right\}} \\ {0 if n_{i} \varepsilon \left\{ {U,} \right.\left. G \right\}} \\ \end{array} } \right., z_{i} = \left\{ {\begin{array}{*{20}c} {1 if n_{i} \varepsilon \left\{ {A,} \right.\left. U \right\}} \\ {0 if n_{i} \varepsilon \left\{ {C,} \right.\left. G \right\}} \\ \end{array} } \right.$$
(9)

Thus, the four types of nucleotide, A, U, G and C, can be encoded as (1,1,1), (0,0,1), (0,1,0), (1,0,0), respectively.

In order to better represent the distribution of each nucleotide in the RNA sequence, the density of a nucleotide, which describes the frequency of the nucleotide occurring before current position, is denoted as:

$$d_{i} = \frac{1}{{\left| {N_{i} } \right|}}\mathop \sum \limits_{j = 1}^{i} f\left( {n_{j} } \right), f\left( {n_{j} } \right) = \left\{ {\begin{array}{*{20}c} {1 \;if\; n_{j} = p} \\ {0 \;if\; n_{i} \ne p} \\ \end{array} } \right.$$
(10)

where di is the density of nucleotide, i is the current position of RNA sequence, \(\left| {N_{i} } \right|\) is the length of the ith prefix string \(\left\{ {n_{1} ,n_{2} , \ldots ,n_{i} } \right\}\) in the sequence, and \(p\) is the symbol of {A, U, G, C}.

By integrating the nucleotide chemical property and the distribution of each nucleotide in the RNA sequence, a (\(4 \times \xi\))-dimensional CPD feature vector can be generated, where \(\xi\) is the length of the RNA segment.

Support vector machine

Support vector machine (SVM) is a popular statistical learning method and has been extensively used to build bioinformatics models [23, 50, 55,56,57,58] because of its high efficiency and robust output. In this study, we used the MATLAB function FITCSVM to build our models. SVM uses kernel functions to project low-dimensional data into high-dimensional space. A few different kernel functions can be used in training. In this work, the radial basis kernel function was selected with two hyper parameters (box constraint and kernel scale) to be used with FITCSVM function. The two parameters were optimized by a grid search with box constraint from \(2^{ - 5}\) to \(2^{15}\) and kernel scale from \(2^{ - 10}\) to \(2^{6}\).

Evaluation criteria

Ten-fold cross-validation was used to evaluate the generalization performance based on the training dataset. For the ten-fold cross-validation, the training dataset was divided into ten roughly equal-sized subsets with a stratified sampling, and then one subset was used as a validation set whereas the remaining nine subsets were combined for training. This process was repeated ten times with ten models built and validated. Finally, the average performance was obtained. In this study, the ten-fold cross-validation was used for feature selection, parameter optimization and classifier comparison.

Different metrics were used to assess the model performance, namely accuracy (Acc), sensitivity (Sen), specificity (Spe), precision (Pre), Matthews correlation coefficient (Mcc) and F1-score. The specific formulas are as below:

$$\left\{ {\begin{array}{*{20}c} {{\text{Sen}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}} \\ {{\text{Spe}} = \frac{{{\text{TN}}}}{{{\text{TN}} + {\text{FP}}}}} \\ {{\text{Pre}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}} \\ {{\text{Acc}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FN}} + {\text{FP}} + {\text{FN}}}}} \\ {{\text{Mcc}} = \frac{{{\text{TP }} \times {\text{*TN}} - {\text{FP* FN}}}}{{\sqrt {\left( {{\text{TP}} + {\text{FP}}} \right)\left( {{\text{TP}} + {\text{FN}}} \right)\left( {{\text{TN}} + {\text{FP}}} \right)\left( {{\text{TN}} + {\text{FN}}} \right)} }}} \\ {{\text{F}}1 = \frac{2*TP}{{\left( {2*TP + FP + FN} \right)}}} \\ \end{array} } \right.$$
(11)

where, TP, TN, FP and FN represent the number of true-positive (m5C sites that were predicted as m5C sites), true-negative (non-m5C sites that were predicted as non-m5C sites), false-positive (non-m5C sites that were predicted as m5C sites) and false-negative (m5C sites that were predicted as non-m5C sites) samples, respectively.

In addition, we draw the receiver operating characteristic curve (ROC curve) [59] and precision recall curve (PRC curve) [60], to evaluate the performances of different models. ROC curve demonstrates the relationship between sensitivity and 1-specificity at different thresholds, and PRC curve reflects the trend of precision changing with recall. These two curves can be used to evaluate the predictive capability of the proposed method across entire range of decision values. The areas under these two curves (AUROC and AUPRC) were also calculated to quantify the model performance. AUROC and AUPRC have value ranging from 0 to 1. The closer the value approximate 1, the better the model performance is.

Feature selection

There are three major methods for feature selection: Filter, Wrapper and Embedded. We have chosen the sequence forward selection algorithm (SFS) under Wrapper as the feature selection algorithm in this study. Six types of features are generated and constitute the high-dimensional feature vector of each sample. The following specific operations of SFS were used to achieve a compact and efficient feature subset: in the first round, the ten-fold cross-validation results were obtained for models built on each of the six types of features. The best performing feature type was selected according to the AUROC value and then proceeded to the next round of calculation. In the second round, the remaining five types of features were added to the best performing feature type selected in the first round. Similarly, the best performing feature combination was again selected according to the AUROC value and proceeded to the next round of calculation. This process continued until AUROC converged. The subset with the highest AUROC value was considered as the optimal feature subset.

The entire procedure of m5CPred-SVM is illustrated in Fig. 5.

Fig. 5
figure 5

The flowchart of m5CPred-SVM