Background

Technological advancements have revolutionized our understanding of RNA modifications, which can occur by removal (by deamination, often called “RNA editing”) or by the addition of chemical side groups on the ribose or base moieties. These chemical entities, collectively known as the “epitranscriptome” [1], not only occur in tRNAs and rRNAs, where they were first identified and have traditionally been studied, but also in other molecules, such as mRNAs, long noncoding RNAs, piRNAs, and miRNAs [2,3,4,5,6]. A number of studies have shown that RNA modifications can profoundly affect central biological processes, including cell fate [7], sex determination [8, 9], maternal-to-zygotic transition [10], and the circadian clock [11] as well as plant developmental timing, morphogenesis, and flowering [12]. Furthermore, dysregulation of their activity has been associated with more than 100 different human diseases [13,14,15,16,17]. At a molecular level, modifications can affect the fate and function of the RNA molecules that contain them, including turnover rates [18,19,20], translation efficiency [21, 22], and subcellular localization [23], among others.

Over 170 different RNA modifications are known to decorate RNA molecules [24]. In the last few years, a vast amount of efforts have been devoted to functionally dissecting the biological role of N6-methyladenosine (m6A), the most prevalent internal RNA modification found in human mRNAs. M6A is placed by a multicomponent transferase complex, in which methyltransferase 3 (METTL3) acts as the catalytic subunit [25, 26]. Moreover, m6A modifications can be reversed by the activity of m6A demethylases or “erasers,” namely the fat mass and obesity-associated protein (FTO) [27] and the alkB homolog 5 (ALKBH5), although recent studies have suggested that only the latter can demethylate m6A marks [28]. Mechanistically, m6A modifications can alter mRNA splicing [29,30,31], cause mRNA decay [20], and affect translation [2, 32, 33]; thus, they can govern major cellular processes including cellular fate [34, 35], stress responses [2], and differentiation programs [36]. These features have set out m6A marks, and more specifically, their “readers,” “writers,” and “erasers,” as promising drug targets for multiple diseases, including cancer [37,38,39,40]. However, the functional characterization of the majority of RNA modifications still remains an uncharted territory.

Insights into the physiological roles of specific RNA modification-related proteins (RMPs) have mostly come from naturally occurring phenotypes or diseases associated with their loss of function [13,14,15,16,17]. However, a systematic annotation and characterization of RNA modification RMPs across human tissues, cell types, and disease states is currently lacking.

Here, we have compiled and analyzed the evolutionary history of 90 RNA modification writers as well as the gene expression patterns of 146 human RMPs (Additional file 1: Table S1) from 32 tissues, 10 species, and 13,358 tumor-normal samples. Our analyses revealed that many RMPs display restricted gene expression patterns and/or are dysregulated in specific types of cancer. Specifically, we found that a vast proportion of RNA modification “writers” have undergone duplications (84%) and that these were typically accompanied by a change in their RNA target specificity and/or tissue expression patterns (82%). We observed that the most frequent change in tissue specificity is the acquisition of restricted testis-specific expression, suggesting that a significant portion of the human RNA modification machinery is likely devoted to sperm formation and maturation. We also found that 27% of human RMPs are significantly dysregulated in cancers, and identified several dysregulated RMPs whose expression is strongly correlated with cancer prognosis. Overall, our work reveals an unanticipated heterogeneity of RMP expression across both normal and malignant cell types, and points towards several less-characterized RMPs, such as HENMT1 or LAGE3, as promising drug targets for antitumor therapies.

Results

Comprehensive annotation and evolutionary analysis of RNA modification writers

To reveal the evolutionary history of the RNA modification machinery, we first compiled and manually curated a list of human RMPs (Additional file 1: Table S1, see also “Methods”). Due to the wide chemical variety of RNA modifications, we restricted our evolutionary analysis to the catalytic domain of three major RNA modification “writer” (RMW) classes: (i) methyltransferases, (ii) pseudouridylases, and (iii) deaminases. For each annotated RMW [13, 41, 42], Pfam domains of the catalytic domain were extracted and used as input for HMM-based searches against the human proteome. This resulted in a total of 90 human RMWs, doubling the amount of annotated human RMWs in other resources [41]. To determine the evolutionary history and identify duplication events that occurred in each family, ortholog proteins from representative species were retrieved (see “Methods”), and phylogenetic trees were built to identify the number of duplications occurring within each family. Overall, our analysis identified 46 duplication events (Fig. 1a), which have mainly occurred in the base of Eukaryota, Metazoa, and Vertebrata (Fig. 1b, see also Additional file 2: Table S2).

Fig. 1
figure 1

Evolutionary analysis of RNA modification “writers.” a Detailed overview of the evolutionary history of RMW duplications during eukaryotic evolution. Red stars indicate that proteins do not target RNAs but they are in the same family with an RNA writer protein. Red lines indicate the evolutionary group in which the enzyme has appeared. b Histogram of RMW duplication events throughout eukaryotic evolution. Duplication events were inferred using multiple sequence alignments, coupled to maximum likelihood tree generation, for each family. c, d Maximum likelihood phylogenetic trees of methyltransferase family METTL2A/2B/6/8 (c) and TRMT10A/B/C (d). Cyan squares indicate the node where the duplication occurred. Numbers shown on the branches indicate bootstrapping values

We find that duplications are often accompanied by changes in substrate specificity (Fig. 1c, d), at least in those RMWs where the substrate specificity has been reported. One such case is the family of 3-methylcytosine (m3C) RNA methyltransferases, where the ancestral protein methyltransferase-like protein 2 (METTL2) modifies both tRNAArg and tRNAThr, whereas its paralog enzymes, METTL6 and METTL8, methylate tRNASer and mRNA, respectively [43] (Fig. 1c). Similarly, we find that the N1-methylguanosine (m1G) methyltransferases TRMT10A and TRMT10B modify tRNAs in position m1G9 [44], whereas its paralog TRMT10C has been reported to place N1-methyladenosine (m1A) in mitochondrial tRNAs and mRNAs [3], in addition to m1G in tRNAs (Fig. 1d).

Heterogeneity of expression patterns among duplicated RMPs is conserved across species

We then wondered whether duplicated RMPs might have acquired distinct tissue expression patterns than the ancestral gene. To test this, we examined the heterogeneity of RMP expression patterns across tissues in human and mice, using publicly available RNASeq datasets [45,46,47] (Fig. 2a, see also Additional file 13: Figure S1A for gene-labeled heatmaps). For each gene and tissue, we computed “tissue specificity (TS) scores” [48], which is defined as the deviation of gene expression levels in a given tissue, relative to the average expression across all tissues (see “Methods”). Using this approach, we found that testis is the most distinctive tissue in terms of RMP gene expression patterns, both in human and mouse (Fig. 2a, b). This was due to several RMPs being quasi-exclusively expressed in testis (e.g., ADAD1, ADAD2), but also to several RMPs whose expression levels are significantly increased this tissue (e.g., FBLL1, HENMT1, NSUN7). In contrast, other tissues such as colon displayed none or few tissue-enriched RMPs (Fig. 2b, see also Additional file 13: Figure S1B). Moreover, we found that RMP tissue expression patterns were largely conserved in both mouse and human (Fig. 2c, see also Additional file 3: Table S3).

Fig. 2
figure 2

Analysis of RMP tissue specificity expression in different species. a Heatmap of z-scaled log(TPM) values of catalytic RNA writer proteins (M: methyltransferases; D: deaminases; P: pseudouridylases) throughout human and mouse tissues. In both human and mouse, testis has the most distinct RMP expression pattern in which many genes show very high expression, whereas other tissues such as colon show moderate expression level of RMPs. b Scatter plots depicting tissue specificity analysis, which have been computed by representing the RMP mRNA expression values in a given tissue (y-axis) relative to the mean mRNA abundance in all tissues (x-axis). Scatter plots show that testis has a significant number of tissue-specific genes in both human and mouse, while colon shows no tissue-specific genes in human and only one in mouse. Tissue-specific genes are labeled in red. c Venn diagram of the conservation of tissue specificity between human and mouse. Out of 26 common tissue-specific genes, 16 of them are specifically expressed in the same tissue. d Principal component analysis of amniote tissues based on the log(RPKM) mRNA expression of their RMPs. The loadings plot (left) shows the contribution of each RMP to the clustering of amniote tissues. The score plot (right) shows the clustering of each tissue, where testis tissue (in red) is the main contributor to the variance of the data, and is found apart from the rest of the amniote tissues for every given species. e Schematic representation of the fate of the 46 RMW duplication events shown in Fig. 1, showing that 89% of them suffered a change in their tissue and/or target specificity

To validate the tissue-specific RMP expression patterns, we performed quantitative real-time PCR (qRT-PCR) in four mouse tissues (brain, liver, lung, and testis), finding similar expression patterns to those observed in the RNAseq datasets (Additional file 13: Figure S2). We then examined whether tissue-specific RMP expression patterns would also be observed at the proteomic level, finding that testis tissue showed the most distinctive RMP protein expression levels and patterns among the 17 tissues analyzed [49], whereas other tissues, such as colon, displayed none or few tissue-enriched RMPs (Additional file 13: Figure S3), in agreement with the transcriptomic analysis.

We finally extended our analysis to additional amniote species, finding that testis was also the main outlier in terms of RMP expression patterns in all species analyzed, supporting the notion that testis-specific RMP functionalities are evolutionarily conserved (Fig. 2d, see also Additional file 13: Figure S4). Overall, we found that 89% of RMP duplication events were often followed by a change in tissue specificity (32.6%), target specificity (17.4%), or both (39.1%) (Fig. 2e, see also Additional file 4: Table S4 and Additional file 13: Figure S5), with a major over-representation of acquisition of testis-specific gene expression.

Testis-specific RMPs are mainly expressed during meiotic stages of spermatogenesis

The process of sperm formation, termed spermatogenesis (Fig. 3a), is a highly specialized differentiation process in which transcriptional, post-transcriptional, and translational regulation are highly orchestrated [50,51,52,53]. RNA modifications can influence pre-mRNA splicing, mRNA export, turnover, and translation, which are controlled in the male germline to ensure coordinated gene expression [35]. Recent works have shown that m6A depletion in mice dysregulates translation of transcripts that are required for spermatogonial proliferation and differentiation [34]. Moreover, m5C modifications have been shown to be essential for transmission of diet-induced epigenetic information across generations in the epididymis [54]. However, whether additional RNA modifications may be involved in such orchestration is largely unknown.

Fig. 3
figure 3

Analysis of RMP gene expression during spermatogenesis. a Schematic representation of the four main phases of spermatogenesis: (i) mitotic division of spermatogonia (SPG) into primary spermatocytes (PSC), (ii) meiotic division of PSCs into secondary spermatocytes (SC), (iii) meiotic division SCs into round spermatids (RST), and (iv) spermiogenesis, in which round spermatids (RST) mature into elongated spermatids (EST). b Heatmap of RMP expression levels in mouse testis. RMPs were clustered into 4 groups based on k-means analysis of their normalized average mRNA expression values. c Violin plots of the expression patterns of each of the 4 identified clusters. d RNA median expression barplot and immunohistochemistry of NSUN7, NSUN2, and METTL14, depicting distinct protein expression levels along the different sections of the testis and epididymis, as well as different subcellular localizations. Brown color indicates a specific staining of the antibody whereas blue represents hematoxylin counterstain

To identify at which stage of sperm formation and maturation testis-specific RMPs are involved in, we gathered publicly available single-cell RNA sequencing data from mouse testis [55] (Fig. 3b, see also Additional file 13: Figure S6A for gene-labeled heatmap). We first classified RMPs based on their gene expression patterns (see “Methods”), identifying four main expression patterns: (i) high expression only during mitotic stages (spermatogonia), (ii) high expression in both mitotic and meiotic stages (spermatogonia, spermatocytes, spermatids), although decreased in the latter, (iii) low expression throughout spermatogenesis, and (iv) high expression only during meiotic stages (spermatocytes and spermatids) (Fig. 3b, c, see also Additional file 13: Figure S6B,C).

We find that the majority of RMPs, including those involved in placing, reading, and removing m6A (VIRMA, YTHDC2, YTHDF2, ALKBH5, METTL14, METTL3) are highly expressed in spermatogonial cells, whereas their expression rapidly drops as the spermatogenic process begins (Fig. 3b, c, see also Additional file 13: Figure S6C). Interestingly, this is not the case for all RMPs, such as m5C methyltransferase NSUN7, which is not expressed in early stages of spermatogenesis, but whose expression levels are drastically increased in spermatocytes and spermatids (Additional file 13: Figure S6A,C). Similarly, the testis-specific adenosine deaminase ADAD1 is not expressed in early stages of spermatogenesis, but its expression levels are greatly increased in meiotic stages. Depletion of NSUN7 or ADAD1 are known to cause infertility [56, 57], suggesting that RMPs that are selectively expressed in meiotic stages of spermatogenesis are essential for proper sperm formation and/or maturation. However, the molecular mechanisms behind these infertility phenotypes are largely uncharacterized. Similar expression patterns were observed when analyzing other publicly available single-cell mouse spermatogenesis RNAseq datasets [58, 59] (Additional file 13: Figure S7, S8 and S9).

We then investigated whether specific RMPs also showed increased expression patterns in epididymis, relative to other tissues (Additional file 13: Figure S6D). Our analysis identified two RMPs as epididymis-enriched: (i) TRDMT1—also known as DNMT2—a 5-methylcytosine (m5C) methyltransferase modifying position 38 in specific tRNAs [60], and (ii) METTL1, a N7-methylguanosine (m7G) tRNA methyltransferase, which has been recently shown to act not only on tRNAs, but also on miRNAs, promoting their maturation [61]. Interestingly, TRDMT1 has been shown to play a major role in the transmission of paternal epigenetic information across generations [54]; however, whether METTL1 is involved in the transmission of such information is yet to be determined.

Immunohistochemistry reveals heterogeneity in RMP expression patterns along the epididymis

It is well established that mRNA levels do not always correlate well with protein levels [62]. Thus, to assess whether our findings would hold at the protein level, we performed immunohistochemistry in both testis and epididymis to characterize the expression patterns of 4 RMPs at the protein level: (i) NSUN7, a putative m5C methyltransferase that has been shown to affect sperm motility [57, 63]; (ii) NSUN2, an m5C tRNA methyltransferase involved in sperm differentiation [64]; (iii) METTL14, a component of the m6A methyltransferase complex, which has been shown to be dynamically regulated during spermatogenesis [34]; and (iv) HENMT1, a piRNA 2′-O-methyltransferase responsible for transposon silencing during spermatogenesis [5] (Fig. 3d, see also Additional file 13: Figure S6E).

We found that NSUN7 is most highly expressed in spermatocytes, as well as in the initial segment and caput regions of the epididymis, in agreement with its role in the acquisition of sperm motility [57, 63, 65, 66] (Fig. 3d, left panels). Intriguingly, NSUN7 displayed vesicular-like localization in the epithelial cells of epididymal ducts, with significant accumulation in the apical surface. It is yet to be determined how NSUN7 depletion causes defects in sperm motility, as well as which are the targets of NSUN7 in testis and epididymal tissues. On the other hand, NSUN2 displayed high expression levels in spermatocytes and spermatids (Fig. 3d, middle panels). We also observed that NSUN2 is highly expressed in the initial segment of the epididymis, with decreased expression in the remaining epididymal sections. To identify the subcellular localization of NSUN7 and NSUN2, we performed immunofluorescence assays in mice testis, co-staining with either fibrillarin (FBL, nucleolar marker) or DDX4 (chromatoid body marker [67]) (Additional file 13: Figure S29). We observed that NSUN2 was mainly expressed in the adluminal compartment in later stages of spermatogenic maturation in seminiferous tubules. Surprisingly, we found that the expression of NSUN2 and DDX4 was quasi mutually exclusive, being DDX4 expressed in earlier stages of spermatogenesis, and NSUN2 being expressed in later stages. We did not observe colocalization of NSUN2 with DDX4, in contrast to previous reports [64].

METTL14 was also found to be highly expressed in early spermatogenesis and downregulated during the subsequent stages at the mRNA level (Fig. 3d, right panels), in agreement with the dynamic regulation of m6A levels during spermatogenesis [34]. This result was corroborated at the protein level using IHC, where METTL14-positive early spermatogenic cells are found in the periphery of the seminiferous tubules, while round spermatids and elongated spermatids, located in the very interior of the seminiferous tubules, and spermatozoa, found in the lumen of the seminiferous tubules and epididymis (see Additional file 13: Figure S10), were negative. Finally, HENMT1 was mostly highly expressed in spermatogonia and secondary spermatocytes at the RNA level; however, IHC of HENMT1 did not show stage- or cell-specific staining (Additional file 13: Figure S6E). Overall, our analyses showed that RMPs are dynamically expressed during spermatogenesis and during sperm maturation and that, for the four genes investigated, protein expression patterns were largely in agreement with mRNA expression.

Analysis of RMP expression in tumor-normal paired human samples reveals heterogeneity in RMP dysregulation across cancer types

Due to their ability to modulate RNA metabolism and influence protein synthesis rates, RNA modifications have recently emerged as important regulators of cancer [68,69,70]. Several studies have shown that modulation of the RNA modification machinery can decrease the expression of specific oncogenes [36, 71]. For example, in the case of glioblastoma, treatment with an FTO inhibitor was shown to decrease the expression levels of certain oncogenes [69]. Similarly, tRNA-modifying enzymes NSUN2 and METTL1 can affect chemotherapy sensitivity by changing the methylation states of certain tRNAs [72]. Thus, understanding which epitranscriptomic players are dysregulated in each tumor type is essential to guide the research for future anticancer therapies targeting this regulatory layer.

To this end, we performed an integrative analysis of RMP gene expression across 13,358 tumor-normal paired human samples gathered from publicly available datasets [73], which included 28 different cancer types (Additional file 5: Table S5). Firstly, we compared the expression patterns between paired tumor-normal samples by measuring the log2 fold changes of median gene expression between tumor and normal paired samples, for each RMP and cancer type (Additional file 13: Figure S11 and “Methods”). We found that certain cancer types, such as pancreatic adenocarcinoma (PAAD) and acute myeloid leukemia (LAML), showed significant dysregulation of a vast proportion of RMPs (Additional file 13: Figure S11). Surprised by this result, we wondered whether these global up-/downregulation patterns could in fact be an artifact generated by the use of external datasets. Indeed, certain TCGA cancer types do not have real “matched” tumor-normal data readily available and often employ data from other publicly available datasets (e.g., GTEx) as “normal” human tissue (Additional File 5: Table S5).

To address this issue, we extracted the gene expression levels of all genes—not just RMPs—for each cancer type, finding that certain cancer types that employ GTEx data as source of “normal” human tissues, such as LAML, display low Pearson correlation values between matched tumor-normal samples (r2 = 0.86), compared to those observed in other cancer types such as prostate adenocarcinoma (PRAD) (r2 = 0.98) (Additional file 13: Figure S12). Thus, to identify which RMPs were significantly dysregulated in each cancer type, we computed “dysregulation scores” [48], which take into account the global variance of the tumor-normal paired data, for each cancer type (Fig. 4a). We considered an RMP as dysregulated in a given cancer type if its dysregulation score was higher than 2.5 standard deviations (SD) relative to the linear fit to the gene expression in the matched normal tissue (see “Methods”). Using this strategy, we identified a total of 40 RMPs which are dysregulated in at least one cancer type (Table 1, see also Additional file 6: Table S6). Moreover, we find that the “global” up-/downregulation patterns found using log2 fold change comparisons are not further observed (Fig. 4b), suggesting that these results were in fact artifacts caused by the lack of proper “matched” normal tissues for certain cancer types.

Fig. 4
figure 4

Expression analysis of RMPs in human tumor-normal paired samples. a Heatmap of z-scaled dysregulation scores of RMPs in tumor-normal paired samples, across 28 cancer types. Positive (red) values indicate upregulation in tumor samples, whereas negative (blue) values indicate downregulation. Genes labeled as red (upregulated) and blue (downregulated) represent top significantly dysregulated genes, which are also individually listed in panel c. b Scatter plot comparing RMP expression levels of matched tumor-normal samples, for the following cancer types: LAML (acute myeloid leukemia) and UCS (uterine carcinosarcoma) BRCA (breast invasive carcinoma) and KIRP (kidney renal papillary cell carcinoma). Values represent median log(TPM) across all patients. Black data points indicate the expression of RMPs, where dysregulated genes are highlighted in red (upregulated) or blue (downregulated). Non-RMP genes are depicted in gray. c Barplot illustrates the number of cancer types in which significantly dysregulated genes are highlighted in red (upregulated) or blue (downregulated). Only RMPs that are dysregulated in more than 2 cancer types are shown. For the full list of dysregulated RMPs, see Table 1. d Boxplots of log(TPM) mRNA expression values of HENMT1 (upper panel) and LAGE3 (bottom panel) across all 28 cancer types analyzed in this work. Green box plots represent normal samples, whereas red box plots represent tumor samples. Tumor-normal pairs highlighted in cyan represent cancer types in which the RMP is significantly downregulated, whereas those highlighted in orange represent those cancer types in which the RMP is upregulated. Error bars represent standard deviation of mRNA expression levels across patients. Each data point represents a different patient sample. Abbreviations: ACC (adrenocortical carcinoma), BLCA (bladder urothelial carcinoma), BRCA (breast invasive carcinoma), CESC (cervical squamous cell carcinoma and endocervical adenocarcinoma), COAD (colon adenocarcinoma), ESCA (esophageal carcinoma), GBM (glioblastoma multiforme), HNSC (head and neck squamous cell carcinoma), KICH (kidney chromophobe), KIRC (kidney renal clear cell carcinoma), KIRP (kidney renal papillary cell carcinoma), LAML (acute myeloid leukemia), LGG (brain lower-grade glioma), LIHC (liver hepatocellular carcinoma), LUAD (lung adenocarcinoma), LUSC (lung squamous cell carcinoma), OV (ovarian serous cystadenocarcinoma), PAAD (pancreatic adenocarcinoma), PCPG (pheochromocytoma and paraganglioma), PRAD (prostate adenocarcinoma), READ (rectum adenocarcinoma), SARC (sarcoma), SKCM (skin cutaneous melanoma), STAD (stomach adenocarcinoma), TGCT (testicular germ cell tumors), THCA (thyroid carcinoma), UCEC (uterine corpus endometrial carcinoma), UCS (uterine carcinosarcoma)

Table 1 List of significantly dysregulated RMPs identified using dysregulation score-based analysis

Dysregulation score analyses of tumor-normal paired human samples identify LAGE3 and HENMT1 as top-ranked dysregulated RMPs

We then asked whether specific RMP genes were recurrently up- or downregulated in multiple cancer types, as these could constitute promising drug targets that could be used to treat diverse cancer types. We identified 11 RMPs that were upregulated in two or more cancer types, as well as 8 RMPs which were consistently downregulated in at least 2 cancer types (Fig. 4c, see also Table 1). We found that the most frequently upregulated RMP was HENMT1 (Fig. 4d), a piRNA 2′-O-methyltransferase which is highly expressed in gonadal cells, involved in transposable element (TE) mutagenesis protection [5, 74, 75]. Whether the global upregulation of HENMT1 in cancer samples might be contributing to increased TE mutagenesis is currently unknown.

The second most frequently upregulated RMP was the L antigen family member 3 (LAGE3), a component of complex responsible for formation of N6-threonylcarbamoyladenosine (t6A) in position 37 of tRNAs (Fig. 4d). Interestingly, this modification is found in the anticodon stem-loop of many tRNAs decoding ANN codons [76] and has been shown to affect both translation accuracy as well as efficiency [77]. Upregulation of HENMT1 and LAGE3 expression levels was consistently observed in tumors from distinct stages, with the highest expression in stages III and IV (Additional file 13: Figure S13).

We then examined whether LAGE3 and HENMT1 would be upregulated in patient-derived samples at the protein level. To this end, we employed tissue microarrays (TMAs) in combination with immunohistochemistry, analyzing a total of 72 samples (cores) from both tumor and normal tissues, for 12 different cancer types. Our results show that both LAGE3 and HENMT1 are upregulated in specific tumor types at the protein level (Fig. 5a, b), although the observed differences were not found to be statistically significant (Additional file 13: Figure S14). Nevertheless, our results suggest that LAGE3 and HENMT1 have altered expression levels in specific cancer types also at the protein level.

Fig. 5
figure 5

Immunohistochemical analysis and prognostic value of RMP expression levels in different cancer types. a, b Immunohistochemical analysis and images of normal and tumor LAGE-3 stained LUSC (lung squamous cell carcinoma), LIHC (liver hepatocellular carcinoma), and PRAD (prostate adenocarcinoma) (a) and HENMT-1 stained HGSC (High-grade serous carcinoma), LUSC, and STAD (stomach adenocarcinoma) TMAs (b). Representative cores and subsets are shown for each tissue and antibody, where the brown color indicates a specific staining of the antibody and blue represents the hematoxylin counterstain. Mean TMA score is plotted for each core, with three cores from different individuals per condition quantified. Two-sided Wilcoxon tests did not yield significant differences in any comparison, p values of all tumor-normal comparisons for each cancer type and antibody are shown in Figure S13. c Heatmap of survival p values of 146 RMPs across 28 cancer types. Survival p values are calculated by comparing the prognosis of patients that express high (upper 50%) versus low (lower 50%) RMP levels. “N” column shows the number of patients included for the analysis of each cancer type. d Individual examples of survival plots where the expression levels of the RMP are predictive of cancer prognosis. p values have been calculated by comparing the survival between patients expressing high levels (yellow, top 50%) versus low expression levels (blue, bottom 50%)

Finally, we asked whether the expression levels of RMPs might be correlated with cancer prognosis. We identified 283 cases where RMP expression patterns are significantly associated with patient survival (Fig. 5c, see also Additional file 7: Table S7). For example, we found that high NSUN5 expression levels in glioblastoma (GBM) are correlated with poor prognosis, in agreement with a recent study [78]. Similarly, our work revealed BUD23 expression to be correlated with cancer survival, in agreement with another recent study [79].

Surprisingly, we found that FTO expression levels are not significantly correlated with patient survival in LAML (Additional file 7: Table S7), despite this cancer type being used to test FTO inhibitors [80]. By contrast, LAGE3 expression levels were significantly correlated with patient survival in LAML (Fig. 5d). Among all the RMP-cancer pairs studied, we identified NSUN7 as the top-ranked RMP in terms of prediction of lower-grade glioma (LGG) patient survival (p = 8e− 24), although its biological role still remains uncharacterized. Future research will be needed to functionally dissect the role that NSUN7 plays in glioma, as well as to decipher why its expression levels are highly predictive of patient survival.

Discussion

Over the past decade, systematic efforts to detect and map RNA modifications have boosted the new field of epitranscriptomic research. Many proteins are involved in the writing, reading, and erasing of RNA modifications, but their roles in tumorigenesis and potential as therapeutic targets remain largely uncharacterized. To bridge this gap, here we have compiled a list of 146 human RNA modification-related proteins (RMPs) (Additional file 1: Table S1) and have analyzed the evolutionary history and gene expression patterns of 90 RMPs across 32 mammalian tissues, 10 species, 5 cell types, and 13,358 tumor-normal paired cancer samples.

Through this analysis, we identify a large amount of duplication events in multiple RNA modification families (Fig. 1) and find that duplications are often accompanied by the acquisition of restricted tissue expression patterns and/or change in its RNA target specificity. Therefore, RMP gene duplication is a strategy to acquire novel functions and is typically achieved by altering the expression patterns and/or RNA target selectivity of the paralog proteins, in agreement with other works studying gene evolution [81]. We find that the majority of tissue-restricted RMPs are in fact testis-enriched (Fig. 2), suggesting that certain RMPs might play a pivotal role in sperm formation and maturation. Moreover, deletion of testis-enriched genes such as NSUN7, ADAD1, or HENMT1 leads to male sterility [5, 56, 57, 63].

At the beginning of spermatid elongation, nuclear condensation starts, and consequently, the transcriptional machinery is shut down. Therefore, to provide proteins for the following maturation steps of sperm assembly, mRNAs have to be premade in spermatocytes and round spermatids, before nuclear condensation happens, and translationally repressed until needed [50,51,52,53]. Chemical RNA modifications provide an ideal platform to achieve the fine regulation that is required upon transcriptional shutdown, determining which RNAs are expressed, repressed, or undergo decay [82]. In this regard, previous work has shown that METTL3/METTL14 mediated m6A modification is dynamically regulated in spermatogenesis [34]. Similarly, piRNA molecules in germline cells are tightly regulated by HENMT1, via 2′-O-methylation of their 3′ ends [5]. Here we show that a vast proportion of RMPs are dynamically regulated during spermatogenesis as well as during sperm maturation in the epididymis and, as such, may be involved in the regulation and decay of specific transcripts that occur during sperm formation and maturation (Fig. 3).

Recent works have shown that specific RNA modifications are essential for the transmission of paternal diet-induced phenotypes intergenerationally [54]. Here, we identify two RMPs (TRDMT1 and METTL1) whose expression is significantly enriched in epididymis (Additional file 13: Figure S3), one of which (TRDMT1) was recently shown to be involved in the transmission of diet-induced paternal phenotypes across generations [54]. Whether METTL1 plays a role in intergenerational inheritance is yet to be deciphered; however, recent insights showing its role in miRNA maturation [61] suggest that this enzyme might be playing a role in miRNA-acquired inheritance of information.

In the last few years, several studies have placed RNA modifications in the forefront of cancer research [36, 38, 68, 70, 83], mostly focused on the machinery responsible for writing and erasing m6A modifications. For many years, FTO was thought to be of special interest due to its association with obesity [84]. However, later studies proved this genome-wide association to be false [85] and that the single nucleotide variant present in the FTO intron was in fact associated with the activity of neighboring genes [85].

Nonetheless, FTO kept receiving special attention due to its perceived activity as an eraser of N6-methyladenosine (m6A) [27], the most frequent type of RNA modifications present in mRNAs. However, this is now thought to be incorrect, as later studies showed that FTO is in fact an eraser of N6,2’O-methyladenosine (m6Am), which is much less abundant in mRNAs [28, 86]. Similarly, FTO has been proposed to constitute a promising target for antitumor therapies [38, 80, 87]. While FTO has been shown to play an important role in leukemia [87], it is possible that additional RMPs such as HENMT1, which is drastically dysregulated in this cancer type, might constitute a better drug target to inhibit leukemogenesis (Fig. 4).

Here, we show that the expression of 40 RMPs is significantly altered in tumor samples, relative to their matched normal samples (Table 1 and Fig. 4). Moreover, we identify two enzymes, LAGE3 and HENMT1, as the top recurrently upregulated RMPs across cancer types. Surprisingly, these proteins have so far received little attention in cancer research studies. LAGE3 mutations are known to cause multiple human diseases, including nephrotic syndrome and microcephaly [88]; however, its role in tumorigenesis and cancer progression is yet to be determined. Here, we attempted to validate the upregulation of LAGE3 and HENMT1 across a battery of 12 cancer types using tissue microarrays (TMAs) (Fig. 5). While we were able to identify several cancer types where LAGE3 and HENMT1 were consistently upregulated, the variability among cancer grades across the tumor cores, together with the low number of cores per tumor type (n = 3) led to insufficient statistical power to identify significant expression changes. Future work will be needed to decipher the biological role of LAGE3 and HENMT1 in cancer, as well as its potential use as a target for diagnostic and prognostic purposes.

Conclusions

Our analyses reveal an unanticipated heterogeneity in the expression patterns of RMPs across healthy mammalian tissues, with an over-representation of testis-specific RMPs, many of which are essential for sperm formation and maturation and, in some cases, are required for the transmission of epigenetic information across generations. In addition, we uncover a large proportion of dysregulated RMPs in multiple cancer types and show that several RMPs are dysregulated to a much larger extent than commonly studied m6A modification pathway, stressing the need to extend the epitranscriptomic drug targeting strategies to additional RNA modification enzymes. Now that novel transcriptome-wide tools to map additional RNA modifications have been recently made available, the community can repurpose antitumoral strategies to those RNA modification pathways that are most significantly dysregulated in each cancer type.

Methods

Compilation of human RNA modification-related proteins (RMPs)

An initial list of human methyltransferases, deaminases, and pseudouridylases was obtained by merging the lists available in the MODOMICS database (http://modomics.genesilico.pl/) and from a recently published review [13]. These lists were further initially completed with candidate genes by the addition of annotated proteins on Uniprot [42]. For each of these proteins, hidden Markov model (HMM) profiles of the corresponding PFAM catalytic domains were retrieved (Additional file 8: Table S8) by querying the PFAM database (https://pfam.xfam.org/). Each HMM profile was then used to query the human proteome using the hmmsearch function from HMMER software v.3.2.1 (http://hmmer.org/). Proteins above default threshold we kept as candidate RMW proteins (Additional file 9: Table S9). Related information for each of these proteins (modification type, target RNA, localization) was extracted from Uniprot, as well as from relevant literature [42]. Additional tRNA writer proteins were gathered from a recent study matching tRNA modifications to their writers [89]. Readers, erasers, and non-catalytic subunit proteins were obtained from annotated Uniprot genes as well as from published literature [90]. APOBEC3G and APOBEC3A were included in the analyses due to recent literature showing their deamination activity on RNA molecules in vivo in addition to acting on DNA [91, 92].

Phylogenetic analysis

We first built a set of representative eukaryotic species, by choosing one species for each major phylogenetic clade for which complete proteomes were available. Our final list of representative species consisted of 25 complete proteomes from UniProt [42], which included 23 eukaryal species, as well as 2 outgroups (1 bacteria and 1 archaea) (Additional file 10: Table S10). For each proteome and RMW, we performed HMM-based searches, as described above. Candidate orthologs were manually curated to ensure that we did not miss any ortholog in our analysis, which resulted in a final table of RMW ortholog proteins (Additional file 11: Table S11). For each curated ortholog dataset, multiple sequence alignments were built using MAFFT with G-INS-1 method [93]. Alignment files were used to construct maximum likelihood phylogenetic trees using IQ-Tree with bootstrapping (n = 5000) [94]. Consensus trees were visualized using FigTree v 1.4.4 [95] and used to identify the duplication events (Additional file 2: Table S2).

Tissue specificity analysis

Human mRNA expression levels (TPM—transcripts per kilobase million) for each of the 146 human RMPs were downloaded from the Genotype Tissue Expression (GTEx) dataset [45], version v7, as well as from the Human Protein Atlas (HPA) [96]. Three GTEX tissues (whole blood, transformed lymphocytes and transformed fibroblasts) were discarded from downstream analyses, as these have been previously considered as outliers that can bias the analyses [45] or are not normal tissues of the human body. mRNA expression levels for adult mouse tissues (TPM) were obtained from a study that is part of the ENCODE project [97]. For each dataset (HPA, GTEx, ENCODE), we log transformed the TPM values after the addition of a pseudocount. To determine which genes were tissue-specific, we compared the expression levels of RMP in a given tissue to the median expression levels of RMPs across all tissues. We then calculated residuals (using rlm function), which we refer to as “tissue specificity score” (TS), for each RMP to the regression line of each tissue. An RMP was considered tissue-specific if their TS was greater than 2.5 standard deviation (SD), as previously described [48], which, in a normal distribution of the standardized residuals, equals to the region outside of the 97.9 percentiles.

RNA extraction from mice tissues and quantitative real-time PCR

Brain, liver, lung, and testis tissues were collected from 20-week-old C57BL/6 J mice in triplicate. RNA was extracted from tissues using TRIzol™ Reagent (15596018, Thermo Fisher Scientific) and Chloroform (C2432, Vidra Foc) as per the manufacturer’s instructions, and precipitated with isopropanol (BP2618-500, Thermo Fisher Scientific) and Pellet Paint® Co-Precipitant (69049, Novagen). Samples were DNase treated with Turbo™ DNase (AM2238, Thermo Fisher Scientific) for 15 min at 37 °C and cleaned up using Agencourt RNAClean XP beads (A63987, Beckman Coulter) as per the manufacturer’s instructions. Quality of the extracted RNA was assessed using a Nanodrop™ Spectrophotometer 2000. cDNA was synthesized using Superscript II™ (18064014, Thermo Fisher Scientific) following the manufacturer’s instructions. Quantitative Real-Time PCR (qRT-PCR) was performed with Power SYBR™ Green PCR Mix (4367659, Thermo Fisher Scientific) using ViiA™ 7 Real-Time PCR System as per the manufacturer’s instructions. For each primer pair, three biological replicates with three technical replicate reactions were performed (total of 9 reactions per primer pair). METTL5, which is expressed stably among the four mouse tissues studied [98], was used for normalization purposes. Results were also analyzed using GAPDH for normalization purposes. qRT-PCR plots were built using GraphPad Prism 8. All oligonucleotides used for qRT-PCR can be found listed in Additional file 12: Table S12.

RMP expression analysis across tissues in amniote species

mRNA expression levels of 12 amniote species (human, chimpanzee, bonobo, gorilla, orangutan, rhesus macaque, mouse, gray-short tailed opossum, platypus, and chicken) were obtained from GSE30352 [99]. Normalized RPKM values of constitutive exons for both amniote and primate orthologs were used for downstream analyses. Heatmaps of the log transformed (with a pseudocount) and row (gene) z-scaled tissue-wide mRNA expression values were built using complex heatmap R package. PCA analysis was performed using prcomp function of R, and plots of scores (amniote and primate tissues) and loadings (orthologous genes) were plotted for the first two principal components using ggplot R package.

Analysis of RMPs expression during spermatogenesis

Processed spermatogenesis data was extracted from GSE112393 [55]. Input data was used to perform k-means clustering of RMPs based on their expression profiles in different sperm cell populations. The optimal number of clusters was calculated by plotting the within groups sum of squares by number of clusters extracted using k-means function in R, following criteria used by Scree’s test. Heatmaps were built using the complex heatmap R package. Violin plots were built using the ggplot R package. To assess the consistency of our results across diverse datasets, we analyzed the RMP expression patterns from two additional mouse spermatogenesis studies [58, 59]. For the first dataset, we used the same gene cluster groups and plotted the corresponding heatmap and violin plots using the ggplot R package (Additional file 13: Figure S7). For the second dataset, we obtained the graphical representations for individual RMPs (Additional file 13: Figure S8) from the interactive website accompanying the paper [59].

Immunohistochemistry

Testis and epididymis from 6- to 12-week-old C57BL/6J mice were fixed overnight at 4 °C with neutral buffered formalin (HT501128-4L, Sigma-Aldrich) and embedded in paraffin. Paraffin-embedded tissue sections (3 μm in thickness) were air dried and further dried at 60 °C overnight. Immunohistochemistry was performed using The Discovery XT Ventana Platform (Roche). Antigen retrieval was performed with Discovery CC1 buffer (950-500, Roche). Primary antibodies rabbit polyclonal anti-NSUN2 (20854-1-AP, Proteintech), rabbit polyclonal anti-NSUN7 (PA5-54257, Thermo Fisher Scientific), rabbit polyclonal anti-HENMT1 (PA5-55866, Thermo Fisher Scientific), and rabbit polyclonal anti-METTL14 (HPA038002, HPA038002) were diluted 1:1000, 1:100, 1:150, and 1:2000 respectively with EnVision FLEX Antibody Diluent (K800621, Dako, Agilent) and incubated for 60 min. Secondary antibody OmniMap anti-rabbit HRP (760-4311) was incubated for 20 min. Detection of the labeling was performed using the ChromoMAP DAB (760-159, Roche). Sections were counterstained with hematoxylin (760-2021, Roche) and mounted with Dako Toluene-Free Mounting Medium (CS705, Agilent) using a Dako CoverStainer (Agilent). Specificity of staining was confirmed with a rabbit IgG, polyclonal Isotype Control (ab27478, Abcam). Brightfield images were acquired with a NanoZoomer-2.0 HT C9600 digital scanner (Hamamatsu) equipped with a × 20 objective. All images were visualized with a gamma correction set at 1.8 in the image control panel of the NDP.view 2 U123888-01 software (Hamamatsu, Photonics, France). Mice samples were collected, prepared as paraffin blocks, sliced, and stained at the IRB Histopathology Facility. Negative controls for each antibody were also included, which showed no staining (Additional file 13: Figure S15). All IHC experiments were performed in biological triplicates.

Immunofluorescence

Testis and epididymis from 12-week-old C57BL/6J mice were embedded in Tissue-Tek® O.C.T™ Compound (4583, Sakura) and 12-μm sagittal sections were mounted on SuperFrost™ microscope slides (12372098, Thermo Fisher Scientific). Tissue sections were defrosted, circled with a PAP pen (Z377821, Sigma-Aldrich), fixed in 4% PFA (28908, Thermo Fisher Scientific) for 10 min, and permeabilized in 0.5% Triton-X 100 for 30 min (T8787, Sigma-Aldrich). Subsequently, sections were blocked in 5% BSA (A7906, Sigma-Aldrich) for 45 min at room temperature and incubated in primary antibody in 5% BSA overnight at 4 °C. Primary antibodies were used at the following dilutions; 1:40 rabbit polyclonal anti-NSUN2 (20854-1-AP, Proteintech), 1:20 rabbit polyclonal anti-NSUN7 (PA5-55866, Thermo Fisher Scientific), 1:50 mouse monoclonal anti-DDX4 (ab27591, Abcam), 1:250 mouse monoclonal anti-Fibrillarin (38F3, Novus Biologicals), and 2 μg/mL IgG Isotype controls (G3A1 and 2791, Cell Signalling). Sections were incubated with 1:400 Alexa488-coupled anti-mouse (A-11001, Thermo Fisher Scientific) and Alexa555-coupled anti-rabbit (A-21429, Thermo Fisher Scientific) secondaries and counterstained with 1:10,000 Hoechst 33342 (H3570, Life Technologies) for 2 h at room temperature, then mounted with Fluoromount™ Aqueous Mounting Medium (F4680, Sigma-Aldrich). Prepared slides were imaged on a Leica TCS SPE using a 63X NA1.4 oil objective. Three 1024 × 1024 representative regions of interest were imaged per testis (n = 3) over a 3D stack (3–5 μm depth with a z-step size of 1 μm), using a zoom factor of 2. All images were captured with a frame average of 4, with the exception of Hoechst which was imaged with a frame average of 2.

Analysis of RMP expression in tumor-normal paired human datasets

TPM expression values were downloaded from the UCSC XENA Project, which contains the TCGA and GTEX RNA Seq data that is processed together to provide more reliable expression analysis with tumor and normal samples [73]. We discarded CHOL, THYM, and DLBC tumor-normal tissue pairs due to lack of proper control of normal tissue (low number of patients) in these cancer types. Data was transformed into log2(TPM + 1) for downstream analyses. For the log2(FC) analyses, we calculated the difference between median log2 expression levels between tumor and normal datasets, for each cancer type and RMP. For dysregulation analysis, we calculated the residuals (using rlm function in R) for all of the gene expression in a given tumor tissue and normal tissue pair, which has been previously termed as “dysregulation score” (DS) [48]. We set the threshold of significance DS at 2.5 standard deviations (SD) as previously described [48], which, in a normal distribution of standardized residuals, equals to the region outside of the 97.9 percentiles. We then extracted the dysregulation scores of the RMPs and used it for further downstream analyses. For heatmap representations, dysregulation scores were scaled and centered, and the final heatmap was built using complex heatmap R package. Scatter plots of median log2 expression values for all genes in tumor-normal paired data were built using the ggplot R package, highlighting RMPs in black, significantly dysregulated RMPs in red (upregulated in tumor) and blue (downregulated in tumor), and non-RMP proteins were depicted in gray.

Survival analyses

Survival phenotypes were downloaded from the XENA Platform, using the “TCGA TARGET GTEX” cohort [73]. In order to analyze the survival data, we first determined patients that have “high” (upper 50% relative to average expression) and “low” (lower 50% relative to average expression) expression of a specific gene, and matched these patients with their overall survival information. We then used the survminer R package to plot survival curves for each gene and every cancer type, as well as to extract the survival p values. p values were transformed by inversion and subsequent log-transformation with a pseudocount [log(1/p + 1)]. Heatmap of the survival p values was built using complex heatmap R package. Transformed survival p values were visualized using ggplot.

Tumor microarray immunohistochemistry and analysis

Multiorgan tumor with adjacent normal tissue microarray slides with accompanying pathology grade, TNM (tumor, node, and metastasis) classification, and clinical stage information was purchased from US Biomax Inc. (BCN721a). Each slide contains three malignant and three normal cores from 12 types of human organs (esophagus, stomach, colon, rectum, liver, lung, kidney, breast, cervix, ovary, prostate, and pancreas), each core taken from different individuals. TMAs were stained at IRB Histopathology Facility. Primary antibodies rabbit polyclonal anti-HENMT1 (PA5-55866, Thermo Fisher Scientific) and rabbit polyclonal anti-LAGE3 (HPA036123, Sigma-Aldrich) were diluted to 1:50. Secondary antibody OmniMap anti-rabbit HRP (760-4311) was incubated for 20 min. Detection of the labeling was performed using the ChromoMAP DAB (760-159, Roche). For scoring of tissue microarrays, each core was given a score from 0 to 4 based on the proportion of positively stained cells: 0 represents < 2% of cells staining positive, 1 represents 2–25%, 2 represents 26–50%, 3 represents 51–75%, and 4 represents 76–100% staining of cells [100]. Two blinded independent people scored the stainings, following the guidelines described above. Both scorers were blind to both the antibody and tissue type. The scores from each scorer were averaged to obtain the final score per core. A two-sided Wilcoxon test was used to assess significance.