Introduction

In recent years, the field of RNA modification has gained significant prominence, with its origins dating back to the groundbreaking proposal in 2008. To date, over 300 distinct RNA modifications have been identified, with more than 170 of them undergoing extensive investigation [1, 2]. These modifications exert profound influences on RNA molecules, impacting their structural conformation, stability, and functional attributes. Notably, the N6-methyladenosine (m6A) modification has garnered substantial attention due to its prevalence within mRNA and its pivotal role in various biological pathways.

m6A modification stands out as the most prevalent and comprehensively studied post-transcriptional modification within mRNA, almost across the entire transcriptome. This prevalence is particularly pronounced in higher eukaryotic organisms. The significance of m6A modification extends to its far-reaching impact on diverse aspects of biological development, spanning hematopoietic development [3], reproductive processes [4], central nervous system functioning [5] and the regulation of cancer pathways [6]. Consequently, the precise identification of m6A methylation sites has become increasingly imperative in the realm of biological research.

Currently, researchers mainly employ two high-throughput sequencing techniques to profile the m6A sites: MeRIP-seq and miCLIP-seq. MeRIP-seq capitalizes on m6A-specific antibodies to immune-precipitate small RNA fragments following splicing and reverse transcription. Subsequently, the cDNA fragments are sequenced, unveiling the location and extent of m6A enrichment [7]. Besides MeRIP-seq, miCLIP-seq utilizes UV light-induced cross-linking to introduce specific mutational features, which enables the precise identification of the m6A residues in RNA molecules [8]. Nonetheless, the reliability of both methods has been tested extensively, revealing susceptibility to multiple influencing factors, leading to occasional inaccuracies and instability. These factors include antibody specificity, domain fusion, and various statistical approaches aimed at mitigating technical noise [9,10,11]. Beyond these technical concerns, the considerable labor, material, and time costs further intensify the challenges faced by researchers.

To address these challenges, computational databases [12,13,14,15,16,17] and in silicon efforts [18,19,20,21,22,23,24,25] focusing on various biological domains have been developed. For m6A RNA methylation, tools like SRAMP [26] and iRNA toolkits [27,28,29] were established, drawing on a variety of sequence-derived data. The WHISTLE [30], emerged as a high-precision predictor, pioneering the integration of domain knowledge and genomic features into m6A prediction frameworks. More recently, deep learning-based methodologies have also demonstrated their prowess in m6A detection [31,32,33].While these advancements have significantly enhanced in silico identification of modified residues. To date, no research has been made to predict conserved m6A sites across multiple human tissues, despite their established biological significance [34,35,36]. The ConsRM [37] firstly quantified the conservation degree of base-resolution m6A sites between human and mouse transcriptomes. The evolutionary conservation in influenza was researched through potential m6A sites based on DRACH motif [38].

The identification of tissue-conserved m6A sites assumes paramount importance due to their resistance to interference from extraneous factors and their inherent stability, rendering them invaluable indicators for quantifying m6A expression levels. Moreover, most existing prediction tools are predominantly based on sequence information and do not incorporate annotations regarding potential post-transcriptional regulatory functions, which are instrumental in elucidating functional consequences. Therefore, the efficacy of these predictors remains subject to limitations.

Here, we present m6A-TCPred, a web server to predict tissue-conserved m6A sites in human. By learning and testing the m6A datasets identified from 23 human tissues, the newly integrated framework m6A-TCPred (see Fig. 1) provides a high-accuracy mapping of tissue-conserved human m6A sites (an average value of AUROC 0.879). Additionally, our results have been integrated into an online platform: a database to hold 268,115 high confidence m6A sites with their conserved information across 23 human tissues; and a web server to predict the conserved status of user-uploaded m6A datasets. The m6A-TCPred is freely accessible at: www.rnamd.org/m6ATCPred.

Fig. 1
figure 1

The framework of m6A-TCPred. The entire datasets from 23 human tissue were further classified into conserved and normal datasets. The model constructed on the genomic and sequence features were trained and evaluated through cross validation and independent testing. The model was further illustrated by interpretation and gene ontology enrichment

Materials and methods

Training and testing dataset

m6A-TCPred was proposed to predict the tissue-conserved m6A methylation sites. The entire datasets were all high-confidence experimentally validated m6A sites (identified in at least two independent studies) collected from m6A-Atlas database [39]. To comprehensively expand the prediction range, a total of 268,115 base-resolution m6A sites were first filtered from m6A-Atlas (with record time > 2). Next, by checking with m6A-containing regions identified from 23 different tissue contexts, we defined the m6A sites simultaneously appeared in all tissue contexts as tissue-conserved m6A sites (dataset P), while the m6A sites appeared in less than 30% of tissue contexts as tissue-specific m6A sites (dataset N). Importantly, the m6A sites that did not appear in any tissue context were excluded to avoid potential bias. Totally, the dataset P contains 10,424 m6A sites, while dataset N includes 54,949 m6A sites across 23 human tissues. To maximize the use of dataset P, the dataset N was further split into 10 sub-datasets and then developed into 10 models in 1:1 Positive-to-Negative ratio with positive datasets to achieve average performance. For performance evaluation, 80% of the dataset was randomly selected as training data, while the rest of 20% was used for independent testing. Please refer to Additional file 1 for the detailed dataset collected to develop the prediction framework.

Feature extraction

Sequence-derived features. Encoding approaches based on sequence information have been broadly applied and achieved good performance in prediction [40,41,42]. Our new model also adopted the encoding strategy consisting of two parts: Chemical Properties of nucleotides (NCP) and Electron–Ion Interaction Pseudopotential (EIIP).

The first encoding was originally employed in the prediction of DNA sequence splicing site and its efficiency has been confirmed in RNA modification prediction [43, 44]. It depends on the structural differences of chemical properties from ring structure, functional groups and hydrogen bonds. Specifically, one ring structure exists between cytosine and uracil, as well as two ring structures between adenine and guanine. Adenine and cytosine have amino groups, while ketone group is carried between guanine and uracil. Guanine and cytosine, connected by three hydrogen bonds, have stronger binding ability than adenine and cytosine with two hydrogen bonds. As a result of these three structural chemical properties, the encoding of i-th nucleotide of given sequence S will be conducted as Vector Si = (Xi,Yi,Zi):

$$X_{i} = \left\{ {\begin{array}{*{20}c} 1 & {if\;S_{i} \in \{ A,G\} } \\ 0 & {if\;S_{i} \in \{ C,U\} } \\ \end{array} } \right.,\quad Y_{i} = \left\{ {\begin{array}{*{20}c} 1 & {if\;S_{i} \in \{ A,C\} } \\ 0 & {if\;S_{i} \in \{ G,U\} } \\ \end{array} } \right.,\quad Z_{i} = \left\{ {\begin{array}{*{20}c} 1 & {if\;S_{i} \in \{ A,U\} } \\ 0 & {if\;S_{i} \in \{ C,G\} } \\ \end{array} } \right.$$
(1)

Therefore, A, C, G and U can be encoded by vectors (1,1,1), (0,1,0), (1,0,0) and (0,0,1) respectively.

The Electron–Ion Interaction Pseudopotential (EIIP) was calculated from the delocalized electrons energy from amino acids and nucleotides [45]. This encoding strategy was originally used in the DNA sequences to locate exons and gradually promoted to RNA sequences field [46]. In EIIP method, each nucleotide in RNA sequence was standing for a numeric value to represents its EIIP energy. Specifically, the EIIP for nucleotide A, G, C, U is 0.1260, 0.0806, 0.1340 and 0.1335, respectively.

Genome-derived features. To capture the distinct characteristics of tissue-conserved m6A sites across human tissues, we extracted 54 additional genomic features from m6ALogisticModel package (generation R code in Additional file 5). These features were selected to accurately represent the topological attributes of the modified residues.1–15 are dummy variable features that indicate whether the tissue-conserved m6A sites are located within specific transcriptome regions with unique topological properties. All genomic features were generated using the R/Bioconductor package and the hg19 TxDb transcriptome annotation package [47]. In addition, only the primary (longest) transcript of each gene was kept to avoid ambiguities arising from transcript isoforms and extract transcriptional sub-regions [48]. Features 16–19 provide actual valued information on the relative transcript region position, including 3’UTR, 5’UTR, and the whole transcriptome. Sites falling outside these regions are assigned a value of zero. Features 20–27 detail the length of the transcript region containing the modification site, with a value of zero for sites not belonging to the region. Features 28–31 describe the distance between the adenine site and the splicing junction 5’end or 3’end. The distance to the closest tissue-conserved m6A site in the training data is calculated to quantify the clustering effect of conserved m6A sites. Features 32 and 33 provide the evolutionary conservation score of the conserved adenosine site and its flanking regions, calculated using Phast-Cons score [49] to assess the conservation level of potential nucleotide sequences. The RNA structure surrounding the conserved adenine site is included in features 34–35, predicted using the RNAfold Vienna RNA package [50]. Features 36–43 illustrate the structural function of m6A regulatory binding complexes, including readers, writers, and erasers. Features 44–48 encompass genomic properties of genes or transcripts containing conserved m6A sites. Features 49–51 indicate the z-score of GC content, while features 52–54 provide omics information, such as microRNA target sites. Detailed information of each genomic feature can be found in Table S1 in Additional file 3.

Machine learning approach used for prediction of tissue-conserved m6A site

The Support Vector Machine (SVM) approach is a data-driven machine learning algorithm widely utilized for classification tasks. Notably, SVM outperforms artificial neural networks in scenarios involving extensive genomic data, yielding lower error rates [51]. This technique has been previously applied to identify biomarkers from gene expression data, explore protein interactions [52], pinpoint therapeutic cancer targets, and achieve genome-wide recognition using diverse high-throughput datasets [53].In this research, the prediction model was constructed on R studio interface of LIBSVM [54], with radial basis function serving as kernel. The other parameters were employed by default.

Performance evaluation of tissue-conserved m6A site prediction

To comprehensively evaluate the performance of our predictor, the SVM classifier was examined through fivefold cross-validation on training datasets and tested from 10 independent testing datasets.

To comprehensively evaluate the prediction model performance, five metrics were introduced as the indicators to examine the reliability. Receiver Operating Characteristic (ROC) curve (sensitivity against 1-specificity) and Area Under ROC Curve (AUROC) are the primary performance evaluation indicators. The other four metrics, Sensitivity (Sn), Specificity (Sp), Matthew’s Correlation Coefficient (MCC) and Overall Accuracy (ACC) were also employed to quantify and test the model reliability. The entire process was conducted in R language.

$$S_{n} = \frac{TP}{{TP + FN}}$$
(2)
$$S_{p} = \frac{TN}{{TN + FP}}$$
(3)
$$MCC = \frac{TP \times TN - FP \times FN}{{\sqrt {(TP + FP) \times (TP + FN) \times (TN + FP) \times (TN + FN)} }}$$
(4)
$$ACC = \frac{TP + TN}{{TP + TN + FP + FN}}$$
(5)

where the TP, TN, FP and FN respectively represent true positive; true negative; false positive and false negative.

Website construction

The predictor website platform is based on Hyper Text Markup Language (HTML), Cascading Style Sheets (CSS) and Hypertext Preprocessor (PHP), as well as the MySQL tables for metadata storage.

Results

Determine the best machine learning sequence strategy and algorithm for tissue-conserved m6A site prediction

Although the sequence encoding strategy has already achieved a little reliability in prediction. In order to achieve the best classifier to construct the m6A-TCPred, we combined it with genomic features and tested its performances under different sequence encoding strategies and classifiers on independent dataset (see Table 1). The highest AUROC score is 0.669 when the encoding strategy adopts NCP + EIIP under SVM classifier, representing the best performance among all sequence encoding and algorithm.

Table 1 Performance evaluation of tissue-conserved m6A sites of using different sequence strategy and algorithms

Performance evaluation of tissue-conserved m6A predictor by benchmark and independent testing

All the features are normalized and converted into numerical matrix between 0 and 1. The final tissue-conserved m6A prediction model was constructed based on combination of sequence and genomic features.

To comprehensively evaluate the model, ten independent datasets were randomly extracted from negative datasets and integrated with the positive data for average performance evaluation. The prediction model achieved the average value of AUROC 0.879, as well as AUROC 0.871 of fivefold cross-validation (Table 2), suggesting reliability in model distinguishment.

Table 2 Prediction performance using cross validation and independent test dataset

Feature ranking and functional characterization of tissue-conserved m6A sites

The feature ranking illustrates the performance efficiency of all features when model processing, which points out the contribution of various features to identifying tissue-conserved m6A sites. The Fig. 2A listed the top 15 most effective features in predicting tissue-conserved m6A sites, from which we can observed that exon regions may have strong associations in distinguishing tissue-conserved m6A sites from non-conserved ones, especially for long exon regions (> 400 bp). Additionally, feature selection was performed. When using the top 24 genomic features, the model exhibited the best performance with an AUROC of 0.89. As more features were added, the model performance slightly decreased, ranging between AUROC of 0.87 and 0.88 (Figure S1 in Additional file 4). Taken together, this result indicated that feature overfitting has only very limited impact on model performance.

We further examined the biological characterization of the tissue-conserved m6A sites. Specifically, over 260,000 m6A sites were extracted and more than 10,000 of them identified as conserved m6A sites in human tissue for examining their putative functional relevance with Gene Ontology (GO) Enrichment Analysis. In Fig. 2B, we present the top five enriched items in biological pathways, cellular composition, and molecular functions, respectively. Many of these enriched items have previously been confirmed to have strong associations with m6A methylation. Among them, most of them have been studied in several research. For histone modification, the consumption of H3K36me3 has been proved to decrease the abundance of m6A sites [57]. For transcription factor, the m6A-mediated regulation of JUN and JUNB TFs are critical in gene regulation network [58]. A more comprehensive list of the results from the Gene Ontology analysis is available in Additional file 2 (Fig. 2).

Fig. 2
figure 2

Model interpretation. A The top 15 of most contributing features. Features are output based on machine learning model recognition capabilities. B Gene enrichment analysis of the tissue-conserved m6A sites. BP is the biological process; CC is cell component and MF represents molecular function. The gene identification is obtained by using R package RMAnno. Gene ontology analysis is conducted by R package ClusterProfiler [59]

Additionally, to further explore the molecular features of tissue-conserved m6A sites from normal ones, we conducted the motif analysis for both of them. The result (Figure S2 in Additional file 4) showed that the sequences of tissue-conserved m6A sites and normal m6A sites follow the pattern of DRACH motif. Meanwhile, no significant differences were observed between tissue-conserved and non-conserved m6A sites, which was consistent with our finding that sequence-based information alone cannot effectively used for classification.

Web server implementation

To enhance the practicality and accessibility of our prediction model, we have developed a user-friendly web server, which is accessible at http://www.rnamd.org/m6ATCPred. Figure 3A exhibits all m6A site datasets incorporated in our model. The selection menus provide users with the flexibility to filter conserved, non-conserved sites, or examine the tissue counts for each site. Detailed information for each site is easily accessible by clicking on the respective ID (see Fig. 3B). Figure 3C illustrates the functionality of our web server, it allows users to upload their m6A coordinates, while the possibility of input data is evaluated, and each related site is classified into conserved/non-conserved. The free download function is also available.

Fig. 3
figure 3

The database and web server of m6A-TCPred. A The database exhibited the information of all m6A sites. The user can select conserved, non-conserved sites and tissue counts or search according to their own needs. B The detailed information of one m6A site. C The web server allows users to submit their own m6A coordinates. The m6A-TCPred will predict conservative probability of each site

Conclusion

m6A methylation stands as one of the most pivotal RNA modifications, with its substantial abundance in mRNA and integral role in biological processes garnering significant attention. Despite this, the precise identification and localization of conserved m6A sites in human tissues has remained a largely unexplored domain, primarily due to limitations associated with experimental methodologies. In response to these challenges, our research has developed m6A-TCPred, a computational predictor that integrates sequence feature information and genomic features. In contrast to existing predictors, our model excels in the accurate identification of tissue-conserved m6A sites and their discrimination from non-conserved sites across multiple human tissues. Meanwhile, regrading whether tissue-conserved sites are influenced by housekeeping gene, we evaluated the model performance by adjusting datasets that does not contain housekeeping gene. Our result (Table S2 in Additional file 3) showed there is no significant difference compared with original AUROC. Therefore, the m6A-TCPred didn’t have any bias on housekeeping gene.

Our research findings are exciting and the achievements are mainly concentrated on three aspects. Firstly, the m6A-TCPred is a high-accuracy predictor, demonstrating the high efficiency in predicting tissue-conserved m6A sites. Through fivefold cross-validation and independent testing, our model achieves an impressive AUROC score of 0.879, surpassing the previous limitations associated with sequence encoding. To ensure the broad accessibility of our research, we have integrated the entire model into a user-friendly website. This resource is open to all, enabling individuals to submit genome coordinate files and use our predictor for tissue-conserved m6A site predictions. We anticipate that this tool will serve as a powerful resource for researchers delving into the intricacies of conserved m6A sites in human tissue. Secondly, the model ranking and GO analysis provides biological characterizations of tissue-conserved m6A sites. It identified the potential functional regions with high probability of conserved m6A sites. The GO analysis provides a connection between conserved m6A sites and biological functions and some of the content has been mentioned in relevant research. Thirdly, the whole dataset and its relevant annotations were integrated into a website, which is the first collection about tissue-conserved m6A sites.

It is essential to acknowledge certain limitations. The presence of bias in our training datasets, stemming from inherent limitations in experimental techniques, may influence the model's performance. Further, sample size constraints might not capture the full spectrum of potential outcomes. As we continue our research, we remain committed to improving the model's reliability with the incorporation of the latest sequencing data. Additionally, we recognize that the accuracy of our predictions is contingent on the availability of additional information, such as RNA secondary structure, free energy, RNA type, and more.