Background

Among more than 140 kinds of post-transcription modifications (PTMs) [1, 2], N6-methylation (m6A)—the methylation at 6th nitrogen of adenosine, is one of the most abundant modifications [3, 4]. This methylation has been widely found in species such as Arabidopsis thaliana, Saccharomyces cerevisiae, bacteria, virus, human, and mouse [58]. More exactly, these methylation events have occurred in the mRNAs at the 3’ untranslated regions (UTRs) close to the stop codon, following a conserved sequence motif DRACH, [G/A/C] [G/A] A* C [U/A/C], (where A* stands for the m6A site) [9, 10]. The dynamic m6A methylation involves many proteins such as METTL3, METTL14, WTAP, ALKBH5 and YTHDF2 [3, 1113]. With intensive investigation on this dynamic and reversible methylation in mRNAs recently, the functions of m6A in biological processes have been significantly redefined. It is reported that m6A disruption can effect translation efficiency [14], cell viability [15] and cell development [11]. The level changes of m6A in mRNA can lead to abnormality of RNA export, protein translation or RNA editing, causing cancer, obesity, and other human diseases [1619]. For example, strong relationships have been observed between m6A and HIV-1 [20, 21], Zika virus infection [22] and breast cancer stem cell phenotype [23]. The identification of m6A sites is crucial for understanding the disease mechanisms and identifying novel medicine targets.

Experimental approaches including two-dimensional thin layer chromatography [24], high performance liquid chromatography [25], and high-throughput methods (e.g., m6A-seq [9] and MeRIP-Seq [10]) have been applied to identify m6A sites in mRNAs. However, they can only detect m6A-containing transcript fragments instead of identifying the exact methylated adenines [26]. Based on the single-nucleotide resolution m6A maps in mRNAs, researchers have explored computational methods with sequence features and machine learning algorithms to make m6A sites prediction. For instance, iRNA-Methyl [27], m6Apred [6] and RAM-ESVM [28] are predictors aiming at yeast m6A site prediction [29]; methods SRAMP [30], Methy-RNA [31] and RAM-NPPS [32] are built on human and mouse m6A maps [33, 34]. There are also some predictors developed for Arabidopsis thaliana [3537].

One critical issue of this challenging prediction problem is that non-m6A sites are much more than m6A sites in the training data. The existing computational methods have overlooked this imbalance issue. In fact, they trained the model with balanced datasets containing roughly equal sizes of m6A samples and randomly selected non-m6A samples. Such sampling of non-m6A samples may lead to inadequate learning and the prediction models would change when the selected non-m6A samples are different.

Here we use a cost-sensitive XGboost classifier to address the imbalance issue. Similarly as previous works, m6A samples and non-m6A samples are labeled as positive and negative respectively. The classifier is then trained with all the samples without selecting a subset of negative samples and prevents over-fitting by defining different costs for the misclassified positive and negative samples. In the learning stage, the model minimizes the cost function and improves the precision of classifying positive samples. Besides, ROC rather than accuracy is set as the training cost function. Owing to training on the whole dataset without sampling noise, our method which is called HMpre, exhibits higher performance and better robustness.

Another issue of computational m6A prediction is the lack of valid features. The state-of-the-art features are usually derived from window sequences with m6A at the centre position. These features include binary encoding sequence features [30, 35], k-mers [35], physical-chemical properties [38, 39], position-specific nucleotide propensities [40], pseudo nucleotide compositions [28, 41, 42], nucleotide pair spectrums [30] and multi-internal nucleotide pair positions [32].

To improve the effectiveness of feature space, we present three types of novel m6A features. First, we extract novel features to capture specific single nucleotide polymorphism (SNP) variants in the window sequences through the MRMR method and Fisher’s exact test [43]. These features are relevant because single nucleotide variants can effect m6A dynamics [44]. Moreover, m6A occurs richly in some particular regions of transcripts, thus we calculate the absolute and relative locations of m6A sites as new features. To further exploit the distribution properties of nucleotides, entropy information is also considered as new features. Together with these newly proposed features, conventional features including 4-bit binary, overlapping chemical property with density and k-mers are integrated into our feature space to describe comprehensive characteristics of methylation.

In the performance evaluation of our method HMpre, we first report specific SNP positions as new features. Then we report a detailed comparison result with three existing balance learning predictors on an independent test dataset. HMpre achieves a much better performance of precision 0.3035, F1 0.3961 and MCC 0.3329. Since the ratio of positive sites over negative sites in a test mRNA is unknown, HMpre and existing predictors are also evaluated on 9 datasets containing different ratios of positive sites over negative sites. Results show that HMpre works better and has stronger robustness on the ratio change. In practical use, the inputs to a predictor are always individual transcripts, therefore the four methods are then applied to make predictions on single transcripts. Again, HMpre achieves the best overall performance. Furthermore, we evaluate the features effectiveness with 10-fold cross validation and feature importance scores from XGBoost classifier. The new features are all meaningful and the proposed feature space improves performances notably. In the case studies, the transcript of c-Jun gene is taken as an example to demonstrate the prediction details. Then we evaluate our method on the transcript of CBFB gene relating to HIV-1 infection and our method also achieves better results than the other predictors.

Methods

Datasets

Currently validated human mRNA m6A sites were all obtained by Ke and Linda from single nucleotide resolution maps [33, 34]. To guarantee the reliability of negative samples, non-m6A sites conforming to the conserved motif DRACH were all produced from these validated transcripts. Based on these datasets, Zhou has built a human mature mRNA m6A dataset which is the largest human m6A dataset so far. The dataset used in our experiments is downloaded from Zhou’s work [30]. After removing redundant and unaligned samples, we get 7506 human mature transcripts in total. We reserved 6280 transcripts for training and 1226 transcripts for independent testing. For each transcript, the number of non-m6A conforming to the DRACH motif is much larger than m6A sites. The training dataset contains 26512 positive samples and 271214 negative samples, while the independent test dataset contains 5644 positive samples and 54744 negative samples. Each sample contains the transcript id, the location of target adenine and the flanking window sequence. All samples used in our dataset are listed in Additional file 1.

Feature space construction

Computational prediction methods usually build features from a flanking window sequence with m6A at the centre position. The size of the flanking window varies from 20 to 50 nts in previous works and we choose the size of 25-nt which is similar to other human predictors. Thus the features are extracted from the 51-nt long sequence. Based on the sequence characteristics of m6A site, we introduce three types of new features: site location related features, features related to entropy information and SNP features. Three types of conventional features are also used. There are totally 509 dimensions in our feature space. The transcript sequences, length information (including coding region and UTRs) and SNP variants are obtained from the Ensembl online human gene database (GRCh38.p10). A diagram of the feature space construction is presented in Fig. 1.

Fig. 1
figure 1

Feature Space Construction

Three new types of sequence features

Site Location Related Features In mature transcripts, m6A sites are rich in some special regions, such as the 3’ UTRs near the stop codon [10]. However, non-m6A sites conforming to the DRACH motif are randomly distributed over the entire transcript. Thus the location of target adenine site in the transcript can be taken as a new feature. Specifically, site location refers to the distance between the target site and the transcript start site. Beside, the relative location of the target site in the whole transcript is also taken as a new feature, which is the ratio of the site location over the transcript length.

Features Related to Entropy Information Because of motif conservation for regulating protein binding sites, the nucleotides around m6A sites have some unique distributions. Shannon information theory can be used to evaluate these nucleotide distributions in the transcript fragment sequences. We calculate Shannon entropy (En), relative entropy (REn) and information gain score (IGS) of all samples as a new type of feature. The scores of these features are calculated as:

$$ En(s)=-\sum\limits_{i \in \{A, G, U, C\}}p_{i}^{s}\log_{2}(p_{i}^{s}) $$
(1)
$$ REn(s)=-\sum\limits_{i\in\{A, G, U, C\}}p_{i}^{s}\log_{2}(\frac{p_{i}^{s}}{p_{0}}) $$
(2)
$$ IGS(s)=En(s)-REn(s) $$
(3)

where \(p_{i}^{s}\) is the frequency of A, G, U, C in sequence s, and p0 is the uniform distribution of each nucleotide occurrence, namely p0=1/4. The frequency of each nucleotide is then combined with the entropy features as a 7-dimension feature vector.

SNP Features Singe nucleotide polymorphism is a kind of variant at specific sites in genome. For SNP sites, several possible nucleotide variations are alleles for this position. As a synonymous single nucleotide variant, SNP changes the sequence of mRNA but does not alter the amino acid sequence of protein [45]. In addition, m6A is regulated by some proteins which also have fixed RNA binding sites, which means the flanking window sequence around m6A site has specific base groups patterns. The SNP variant of mRNA sequence may disrupt the DRACH motif or protein binding regions, leading to failures of m6A dynamic regulations [44]. Hence, we attempted to find positions with unique SNP states. From the Ensembl database, we map SNP variants in the transcript and convert sample sequence into a 51-bit 0/1 vector (i.e., 0 denotes a non-SNP variant position; 1 donates an SNP variant position). As there are various methods to select effective features [46, 47], in this paper Max-Relevance Min-Redundancy (MRMR) algorithm [43] and Fisher’s exact test are adopted to recognize special SNP positions.

MRMR selects positions with a maximal statistical criterion based on mutual information. MRMR tries to find a position subset, which have maximum relevance (dependency) with class and minimum internal redundancy. MRMR adds positions into the subset one by one and the order is determined by relevance to the target class and the redundancy with the other positions. Fisher’s exact test is a statistical significance test. For an individual position, it investigates the SNP variant distribution difference between the positive and negative samples and derives a p-value to assess the difference. A low p-value means the SNP variant at this position has great difference between the negative and positive samples. Finally, we can rank positions with Fisher’s exact test p-value and the MRMR selection order. By calculating the average ranking of MRMR and Fisher’s exact test, positions with a significant SNP specificity can be identified. The SNP variant states of such specific positions are considered as SNP features. The detailed SNP specificity identification algorithm is presented in supplementary Algorithm S1.

Conventional sequence features

4-bit Binary Features Binary encoding is a common feature extraction method to characterize RNA sequences. As mRNA sequence contains four nucleotides A, C, G and U, this encoding method can map every single nucleotide into a 4-bit binary code. The mapping rules are: ’A’- (1,0,0,0), ’C’- (0,1,0,0), ’G’- (0,0,1,0), ’U’- (0,0,0,1). By this way, a 51-nt sequence can be transformed into a 204-dimension feature vector.

Chemical Property with Density (CPD) Based on differences in chemical property, four kinds of nucleotides can be categorized into different groups [48]. In terms of ring numbers in a single base group, C and U have only one ring while A and G have two. Besides, C and G have strong hydrogen bonds when forming secondary structures, whereas hydrogen bonds in A and U are both weak. When considering chemical functionality, amino group contains A and C while keto group includes G and U. Thus, we can divide the nucleotides by different chemical properties and use overlapping encoding rules: ’A’- (1,1,1), ’C’- (0,0,1), ’G’- (1,0,0), ’U’- (0,1,0). In literature work, the density of nucleotide is always used with chemical property features, which calculates the frequency of a nucleotide occurring before current position. Density feature d_i is defined as:

$$ d_{i} =\frac{1}{\left | S_{i} \right |} \sum\limits_{j=1}^{i}f(s_{j}), f(s_{j})=\left\{\begin{array}{ll} 1 & s_{j} = s_{i}\\ 0 & s_{j} \neq s_{i} \end{array}\right. $$
(4)

K-mer Features In mRNA sequence, adjacent nucleotide pairs have influence on mRNA structures and functions. K-mer is the frequency of k-nt adjacent nucleotides. As a global feature, k-mer has been proved to be effective in many sequence based site predictions. The length of k-mer feature is 4k bits. In this paper, we adopt 2-mer and 3-mer. Each sample has a 80-dimension k-mer feature vector.

Imbalance learning

Imbalance learning has been explored for protein binding site prediction [4951] and protein-protein interaction sites identification [52, 53]. However, the imbalance learning for m6A prediction has not been explored. An intuitive way to address this problem is to integrate sampling and ensemble techniques, which trains basic classifiers with different sampling data and combines the results in an ensemble way to reduce the random sampling bias. But it requires effective sampling techniques to select meaningful negative subsets and there are some researches focus on dynamic and cluster ways [54]. Another viable strategy is to introduce cost-sensitive learning models, like weighted support vector machine and cost-sensitive decision trees, using different matrices to describe the costs for classifying sample into wrong class [55].

Here we use a cost-sensitive XGBoost classifier as learning model. XGBoost (eXtreme Gradient Boosting) is a tree boosting algorithm developed by Chen [56]. It is an advanced implementation of gradient boosting algorithm, which has been widely applied for classification problems. XGBoost has some advantages over other cost-sensitive classifiers. Firstly, the regularization can effectively prevent training model from over-fitting. Secondly, embedded parallel processing allows a faster learning speed. Thirdly XGBoost is of high flexibility and allows users to define custom optimization objectives and evaluation criteria. Moreover, XGBoost classifier can learn from imbalance training data by setting class weight and taking ROC as evaluation criteria. Here we implement the model with a python package named xgboost (vision 0.6a2). The parameters can be optimized by 10-fold cross validation in the learning stage. The parameters in our model are: ’lambda’: 700, ’max-depth’: 6, ’eta’: 0.1, ’silent’: 1, ’objective’: ’binary:logistic’, ’booster’: ’gbtree’,’scale-pos-weight’: 6, ’eval-metric’: ’auc’ and training boost round is 400, while other parameters are all default values.

In this paper, our method is compared with three recently published human m6A prediction methods. These three literature methods are: SRAMP [30], Methy-RNA [31] and RAM-NPPS [32]. They all have open access web predictors and SRAMP also provides a tool package for local implementation. The prediction results of Methy-RNA and RAM-NPPS are obtained from the web predictors, while the results of SRAMP are derived from tool package in mature mode.

Performance evaluation metrics

The proposed prediction method is evaluated by 10-fold cross validations and independent test dataset with four frequently used metrics: precision, recall, F1-score and Matthews correlation coefficient(MCC). As RAM-NPPS and Methy-RNA cannot return prediction probabilities, we do not use AUROC or AUPRC as evaluation metrics.

Precision and recall reflect the tendencies of classifier prediction. Recall (also called sensitivity in binary classification) illustrates how many positive samples are rightly classed, and precision shows the ratio of true positive sample ratio in all predicted positive-label samples. There is always a trade-off between precision and recall, so we introduce F1 and MCC to evaluate the overall performance of a predictor. F1-score combining precision and recall together can assess the performance on both balanced and unbalanced test datasets. MCC is also a frequently used metric in classifier evaluation, which return a value between -1 to 1: 1 standing for perfect prediction and -1 for reversed prediction.

Results

We report the specificity results of SNP identification as new features. In the performance comparison and evaluation, we tested our HMpre method and other existing predictors on the independent test dataset. To demonstrate the robustness of our method to deal with the unknown percentages of positive samples in real transcripts, we compared our method with three existing human m6A predictors on datasets of different positive-and-negative sample ratios. To evaluate the performance for the practical use, we tested all the predictors on single transcripts. Lastly, we report the feature effectiveness results of HMpre and XGBoost classifier feature importance scores.

Specific SNP identification as new features

To identify positions with specific SNP variant states as new feature, MRMR and Fisher’s exact test are applied to analyze sequence SNP variant states in the training dataset. As presented in Fig. 2, MRMR and Fisher’s exact test give rankings to all the positions numbered from -25 to 25 in the window sequence.

Fig. 2
figure 2

SNP Specificity Ranking. The black blocks stand for the Fisher’s exact test rankings and the green blocks stand for the MRMR rankings. X-axis is the window sequence sites from -25 to 25. Y-axis is the total ranking of each position. A low ranking means a high SNP specificity at this position

In the process of selecting a position subset, MRMR defines mutual information to evaluate the subset for the inner redundancy and relevance with the target class, then it gives out the order of position selection and we take the order as position importance ranking. The top 12 positions are -2, 25, 1, -21, -1, 18, -24, 16, -11, 20, -15 and -19. Fisher’s exact test can statistically recognize the SNP variant distribution difference for these individual positions between the positive and negative samples, as described by a p-value. With Fisher’s exact test p-values (details in Table S1 of Additional file 2), we can also rank all these positions. The top 12 positions are -2, -1, 2, -25, 15, -24, -19, -4, -23, -21, 12 and -20. Finally we choose the top 12 positions with the highest average ranking as SNP features. These highly ranked positions are illustrated in Table 1. These positions have relatively higher ranking both in MRMR and Fisher’s exact test. Detailed results are listed in Table S2 of Additional file 2.

Table 1 Ranking details of Top 12 specific SNP positions (FET: Fisher’s exact test)

Performance on the independent dataset

Our proposed HMpre is compared with three existing prediction methods on the independent test dataset. The prediction results are reported in Table 2. HMpre achieves the best performance under all metrics except recall; RAM-NPPS has a better recall of 0.6339 than HMpre. The precision of HMpre is 0.3035, 0.04 higher than SRAMP which is the best in the existing predictors. Overall, HMpre achieves F1 score of 0.3961, higher than the best F1 value of the other three predictors (0.3408 by RAM-NPPS). In terms of MCC, Methy-RNA has a value of -0.1619 and SRAMP is 0.2653, about 0.08 higher than RAM-NPPS, but still lower than HMpre’s 0.3329.

Table 2 Performance on the Independent Test Dataset (Methy: Methy-RNA; NPPS: RAM-NPPS)

Robust performance when tested on datasets with different imbalance ratios

In normal situations, the numbers of m6A and non-m6A sites are unknown before prediction. Therefore, a practical m6A predictor should have a strong robustness against the imbalance level change. To appraise the robustness of HMpre and other predictors, we test them on nine datasets whose negative samples to positive samples ratios range from 1:1 to 9:1. Here we adopt the overall metrics F1 and MCC as evaluation criterions. The results are reported in Fig. 3. The F1 and MCC values of all the methods have a trend of decreasing when the imbalance level increases. The F1 scores of RAM-NPPS and Methy-RNA decrease more rapidly than HMpre and SRAMP. For the MCC values, HMpre also has a relatively slow changing rate while the other methods are comparable. Moreover, HMpre has a better performance on all of these datasets under F1 and MCC, proving that HMpre has a stronger robustness.

Fig. 3
figure 3

Performance on Datasets of Different Imbalance Levels. The F1 and MCC values of four predictors are represented. X-axis k is the ratio of the negative samples to positive samples (imbalance level) in a test dataset; Y-axis is metric value

Average performance on 1226 individual transcripts

Since the testing objects are always single transcripts in real cases, the four predictors are evaluated on individual transcripts. There are 1226 transcripts in the independent dataset for the four methods to make predictions. The imbalance levels of the 1226 transcripts are different, and we calculate the average metric values of all the transcripts as the final results for each method. The results are reported in Table 3. Although RAM-NPPS has the highest recall of 0.6582, HMpre achieves the best performance under the remaining four metrics (precision 0.2972, recall 0.6062, F1 0.3658 and MCC 0.3239). Especially, the overall metrics F1 and MCC of HMpre are about 0.07 and 0.08 higher than SRAMP, the best existing predictor.

Table 3 Performance on Individual 1226 Transcripts (Methy: Methy-RNA; NPPS: RAM-NPPS)

Feature importance analysis

Three types of new features are extracted to add to the existing feature space to improve the prediction performance. 10-fold cross validations with different feature spaces are used to verify whether the new feature space actually improves the prediction performance. The performance of the three types of traditional features and their merged features are compared with the proposed feature space in Table 4. The three types of traditional features (four-bits binary coding, chemistry property with density and k-mers) achieve distinct performance and the 4-bit binary features are better than the other two types of features. By joining the three types of conventional features together, all metrics increase comparing with individual features. The proposed feature space, combining conventional and new features together, exhibits the best performance under all metrics.

Table 4 Different Feature Space Performance in Cross Validation (CPD: Chemical Property with Density; Joint: joint of conventional features)

We also attempted to understand more about the role of each feature in prediction. XGBoost can make an inner analysis of feature importance during learning process and output scores for all the features. The importance scores can reveal how meaningful the features are when building model and tell which features plays leading roles in the feature space.

The feature importance scores boxplot is presented in Fig. 4. There are 509 features and their distribution is presented in Table S3 of Additional file 2. The importance scores have a wide range from 0 to 1064. The features with a 0 score are from 4-bit binary and CPD features, corresponding to the motif adjacent sites which are ’GAC’ or ’AAC’ in all the samples. The dimension with the highest score 1064 (f501) refers to the site distance from transcript start site, followed by features of relative location in transcript (f500, scored 943) and sequence entropy (f506, scored 342). Besides, density features in CPD features has relatively high importance scores. Detailed importance scores is shown in Figure S1. For the average score, binary and CPD are much lower than other features while site location and entropy information are obviously higher. K-mers and SNP have comparable average scores. From the results, the three types of new features are indeed significant in the feature space.

Fig. 4
figure 4

Boxplot of Feature Importance Scores

Case studies

In this section, we report two detailed case studies to understand the difference of the four predictors and evaluate their capacity in practical use. First, we describe the prediction results for the c-Jun transcript from the test dataset. The second case study is about the m6A sites in the mRNAs of CBFB gene which can modulate HIV-1 replication and infection [20].

m6A prediction for c-Jun transcript

Transcript ENST00000371222 of c-Jun gene contains 25 verified m6A sites and 47 non-m6A sites conforming to the DRACH motif. HMpre made a prediction of 21 m6A sites: 18 of them are true positives while 3 are false positives. SRAMP made 12 true positive m6A sites and 3 false positives. RAM-NPPS made 14 true positives and 12 false positives. Methy-RNA made the most 31 false positive predictions and identified only 19 true m6A sites. Thus, Methy-RNA achieved the highest true positive rate, but it made the most number of false positive predictions. See Fig. 5. Despite SRAMP achieved a good precision of predicted m6A sites, a large number of true m6A sites were wrongly classified. RAM-NPPS has more false positives and less true positive predictions than SRAMP and HMpre.

Fig. 5
figure 5

Predicted m6A sites in the case studies. The x axis stands for the potential m6A sites confirming to the sequence motif DRACH and the y axis indicates the four predictors. All colored blocks are the predicted m6A sites. Red blocks represent true positive sites and yellow blocks are false positive ones. (a) the prediction results for the c-Jun case and (b) the predictions for the HIV-1 case

Table 5 shows the detailed prediction performance. Overall, the precision, F1 and MCC of our HMpre method are much higher than the other prediction methods. Although Methy-RNA has a high recall 0.96, it has the lowest precision, F1 and MCC. The performance of SRAMP is better than RAM-NPPS, but the recall of SRAMP is the lowest 0.48, suggesting a lot of positive samples are predicted to be negative.

Table 5 Results for the c-Jun gene case study (Methy: Methy-RNA; NPPS: RAM-NPPS)

m6A site prediction for a transcript related to HIV-1 infection

The longest transcript ENST00000290858 of CBFB gene from the Ensembl database was chosen for this case study. There are 62 adenines (A) conforming to the motif in this transcript. The experimentally validated m6A sites of CBFB gene are acquired from RMBase, a online m6A database [57]. Based on these data, we constructed a test dataset of 12 positive samples and 50 negative samples.

The predicted m6A sites are presented in Fig. 5. HMpre made 5 true positive and 3 false positive predictions. SRAMP made 3 true positive and 2 false positive predictions. RAM-NPPS and Methy-RNA made more false positives than true positives: RAM-NPPS had 6 true positives and 16 false positives, and Methy-RNA had 51 false positives and 8 true positives. The predicted m6A sites by SRAMP are mainly correct, but it missed a lot of true m6A sites.

The detailed results are reported in Table 5. Methy-RNA achieves the best recall 0.6666 but the worst precision 0.1702. SRAMP has a high precision 0.7692 but the lowest recall 0.25. Our HMpre method has the best precision 0.56256 and achieves the best performance on the overall metrics F1 0.5 and MCC 0.4203.

Discussion

In this paper, we adopted a XGBoost classifier as the prediction model. On one hand, this classifier can learn from imbalance data which is similar to data in practical prediction situations and inner regularization rules can prevent model from over-fitting; on the other hand, when the scale of training data is quiet large, it would cost classifiers like SVM and Random forest much longer time than our method in training stage.

The efficiency of features is crucial to the performance of predictors. Here, we presented m6A sites with meaningful biological features instead of just using flank window sequence features. In this work, the size of flanking window are fixed to 51-nts which is the same with existing methods. The influence of sequence size on feature efficiency will be studied in next stage of research. In addition, some m6A biological characteristics found recently can be taken as new features in the prediction and we will try them in the future.

Conclusion

To address the problem of class imbalance in the training data for human mRNA m6A prediction, we have proposed a novel computational method called HMpre. The key idea is a cost-sensitive learning model. Three types of new features are also introduced to learn more from the imbalanced training data for the further improvement of the prediction performance. Along with other three existing methods, HMpre was tested on an independent dataset. The results show that our method has better correctness and robustness. The feature importance analysis demonstrates that the new features are exactly meaningful in the prediction. In the detailed cases studies, our method also outperforms over the existing predictors. Class imbalance is a long-neglected but important issue in the m6A prediction problem. Imbalance learning provides a promising way to resolve this issue.