Background

There are over 100 RNA covalent modifications (RCMs) that are deposited on all classes of RNAs at various stages of their lifecycle [1]. RCMs, collectively referred to as the “epitranscriptome” [2, 3] are known to influence RNA stability, splicing, structure, intra- and intermolecular interactions, and translation [4,5,6,7,8,9]. While most RCMs have been observed on ribosomal and transfer RNAs, they have also been observed on messenger RNAs (mRNAs) and long non-coding RNAs (lncRNAs), albeit at lower levels [4, 8, 10]. Although pseudouridine is the most abundant modification among all classes of RNA [10], N6-methyladenosine (m6A) is the most abundant internal mRNA RCM targeting approximately 30% of all mRNAs and is present at ∼ 0.4% of all adenosine nucleotides in cellular RNAs [11, 12]. Additional internal RCMs, such as pseudouridine, N1-methyladenosine (m1A), and 5-methylcytidine (m5C), have been observed on mRNAs in a number of eukaryotic systems, although their functional significance on these RNA molecules is not always clear [13,14,15].

Much of what we do know about the epitranscriptome is drawn from studies on m6A in mammalian and plant systems, but can likely be applied to many of the other modification classes. With the notable exception of pseudouridine, which results from the isomerization of uridine [16,17,18,19], RCMs are deposited through the enzymatic activity of highly conserved proteins called writers (e.g., methyltransferases), “interpreted”, or bound, by RNA binding proteins called readers, and removed by eraser enzymes (e.g., demethylases) [5, 20,21,22]. Arabidopsis and mammals have a similar m6A frequency across the transcriptome (∼ 1 m6A per 2000 nucleotides), which is targeted to a similar consensus motif RRACH (R = G or A; H = U, A, or C), and has a similar 3’ mRNA bias around the stop codon [11, 23]. In plants and mammals, pseudouridine has been found to localize primarily to the CDS and 5’ UTR of mRNAs, is deposited on a wide array of mRNAs, and prefers the first U of a triplet codon (e.g., UUC/UCU/UUU) [17]. Motif preferences for m5C can vary depending on the methyltransferase involved, but in humans are generally found on C/G rich regions of mRNAs [24]. While less is known about binding preferences of the other predominant RCMs, such as m1A and m3C, aberrant addition (or removal) has been observed as a hallmark for tumorigenesis [25, 26]. Importantly, loss of many of the writers, readers, and erasers for these RCMs is lethal or causes severe developmental defects in all tested systems [14, 27,28,29,30].

RCMs function through a variety of molecular mechanisms to regulate mRNA stability, structure, and splicing. RCMs induce changes in base-pairing properties and higher order structure, ostensibly allowing for increased flexibility in interactors, translational capacity, and structure-associated stabilization [31]. There is a close association between splicing and m6A deposition [23, 32], with recent studies revealing that m6A is specifically excluded from splice junctions due to physical occlusion by the exon junction complex, a process that ultimately impacts cytoplasmic mRNA decay [33, 34]. Pseudouridine has also been implicated in mRNA splicing, stability, and translational efficiency [17, 35, 36], as well as the stability and maturation of lncRNAs such as the telomerase RNA [37, 38]. Other modifications, such as m1A, m3C, and m5C are also believed to be important for mRNA stability, mostly due to phenotypes associated with loss of function mutants in their associated writer proteins [25, 35, 39, 40]. Thus, RCMs are both a ubiquitous and critical aspect of an RNA’s lifecycle, but specific mechanistic details for many of them are still lacking.

Given the high frequency with which essential RNAs such as rRNAs, tRNAs and snoRNAs are modified, it has historically been difficult to use genetic approaches to monitor the RCM status of mRNAs and lncRNAs. As a result, most studies have relied either on transcriptome-wide antibody-based (m6A), or modification-specific chemical-genomic approaches (e.g., bisulfite sequencing for m5C; [7, 10, 41, 42]). An alternative sequencing-based approach relies on the propensity for RCMs that occur at the interface between the canonical base-pairing edge (i.e., the Watson-Crick-Franklin, or WCF base-pairing edge) to be misidentified by reverse transcriptases during the cDNA synthesis step [43,44,45,46,47]. This misidentification results in non-random misincorporations or deletions at modified residues. Multiple algorithms, including HAMR and ModTect [15, 44], have been developed to infer modification status, and class, based on these “sequencing errors”. While this approach cannot detect RCMs outside of the WCF base-pairing edge, over 30 modification types can be identified, including m1A, pseudouridine, m3C, and m5C [15, 44, 48, 49]. Importantly, where there is overlap (such as for m5C, m3C, and m1A), the antibody, chemical-genomic, and sequencing error-based approaches are largely in agreement [14, 15]. Inferring modification status via HAMR or ModTect allows both for the repurposing of existing RNA-seq data, of which there are petabases available in NCBI’s SRA [50], and the side-by-side calculation of modification and transcript abundance.

Here we utilized both HAMR and ModTect to analyze new and publicly available RNA-seq in Arabidopsis and three agronomically important and closely related grasses: Sorghum bicolor, Zea mays, and Setaria italica. We used these diverse stress, tissue-atlas, and developmental datasets in an attempt to better understand what some of these less characterized RCMs might be doing in plants and how they might be conserved, both functionally and in terms of their target genes. We present an in-depth comparative assessment of RCM dynamics across these plant species and uncover new insights into RNA splicing, RNA stability, and plant responses to stress. We demonstrate a level of conservation of target genes previously unseen for RCMs, and reveal that RCMs may participate in population-level variation in stress responses. Mechanistically, we identify a link between splicing and RCMs that is present in all examined systems. Finally, through the analysis of RNA decay pathways in Arabidopsis, we establish that RCMs are associated with unusually stable mRNAs, an aspect that may facilitate their continued translation, as we observe a link between modification state and protein abundance.

Results

Identification of post-transcriptional RNA modifications in three poaceae

To determine the degree to which the epitranscriptome is conserved at a gene level, we performed poly-A enriched RNA-sequencing on soil grown seedlings of Zea mays, Sorghum bicolor, and Setaria italica (Maize, Sorghum, and Setaria, respectively, going forward) two weeks after germination. Paired-end, 150 base-pair RNA-sequencing was performed on aboveground tissue for two biological replicates (∼ 20 million reads per replicate; Fig. 1A). Following read mapping, modified sites were inferred using HAMR and ModTect (see Methods). Both of these algorithms take advantage of the propensity for certain reverse transcriptases to misinterpret ribonucleotides that are modified along the canonical base-pairing edge, and as a result, arrest, skip over the RCM, or incorporate non-reference nucleotides (i.e., SNPs) in the synthesized cDNA [51]. This misincorporation is non-random, and both HAMR and ModTect use the resulting population of SNPs to infer modification class. Importantly, both HAMR and ModTect disregard biallelic SNPs that could arise from natural variation, and instead focus on high coverage sites with all three non-reference nucleotides observed (HAMR) or incorporate information about both mismatches and deletions at the modified residue (ModTect). RNA-seq data were fed into both algorithms and sites that were predicted by both algorithms and both replicates were retained for subsequent analyses as high confidence RCMs. From this approach, 5,434-7,020 unique RCMs were identified in 1,944-2,542 transcripts in the three species (Fig. 1B, C, Additional File 1). These modifications represent the seven major classes detected by HAMR, with the m1A|m1I|ms2i6A class being the most common in all three species (approximately 33% of all identified RCMs; Fig. S1A). An examination of sequencing data across modified transcripts reveals a drop in coverage at the modification site, coinciding with a previously observed RT-arrest (Fig. S1B; [52]). Additionally, we find that mRNAs marked by RCMs are present at a substantially higher abundance relative to non-modified mRNAs (modified median TPM = 78.5, n = 2521 mRNAs, not modified median TPM = 10.2, n = 16,537, p < 2.2e-16, Fig. S1 E). Enriched gene ontology terms of modified RNAs in each Poaceae demonstrates significant over-representation in photosynthesis and cytoplasmic translation pathways (Fig. 1D). These data suggest that RCMs are targeted to conserved cohorts of photosynthesis and translation-associated RNAs in these grass seedlings.

Fig. 1
figure 1

RCM identification in three model Grasses. (A) Schematic of experimental design. Left and middle: Two biological replicates of grass seedlings from each species at a similar developmental timepoint were used to generate paired-end RNA-seq libraries. Right: Depiction of RCMs causing errors in reverse transcriptase nucleotide incorporation. Different modifications will result in a non-random pattern of SNPs or deletions at the modified site in the resulting cDNA. Machine Learning algorithms such as HAMR and ModTect can infer the modification class based on these errors. (B) Venn diagram showing the overlap of modified sites in the Sorghum transcriptome between biological replicates and two RCM detection algorithms. The union of all four categories (n = 7020) was kept for downstream analyses. (C) Summarizing the number of modified sites and number of modified transcripts in all three sampled Grasses. (D) Enriched GO terms of the modified RNAs in (C)

To determine if the conservation of RCM deposition goes beyond functional pathway and extends to the genes themselves, we next examined if RCMs are found on transcripts derived from orthologous genes using both sequence homology and synteny. Of the 1,912 possible modified orthologs (the smallest observed number of modified genes in Setaria), we observed 1,074 modified orthologs in one or both of the other species, and 474 orthologs that were modified in all three. This is significantly more orthologs than expected by chance, accounting for the number of annotated and expressed orthologs (p < 2.2e-16 multi-set hypergeometric test, Fig. S1C). We then assessed whether orthologous mRNAs are marked by a similar number of RCMs across species, allowing us to understand whether the RCM density of the targeted RNA is a conserved feature. Indeed, for all three combinations of comparisons between two sampled species, there is a significant positive relationship between the number of RCMs deposited on orthologous mRNAs (Sorghum: Maize r = .39, Sorghum: Setaria r = .43, Setaria: Maize: r = .32, p < 2.2e-16, Fig. S1D). These findings further demonstrate that the targeting and density of RCM deposition are conserved features in the sampled Grasses.

RCMs are dynamically deposited on mRNAs based on tissue and abiotic stress

Our previous analyses were limited to a single developmental time point across three species. Previous reports [14, 30, 53] suggest that RCMs may play important roles in post-transcriptional RNA regulation and thus would be dynamically deposited across development or environmental changes. Therefore, we chose to examine RCMs across diverse tissues and environmental contexts using RNA-seq from publicly available datasets [54,55,56]. We focused our efforts on Sorghum, first identifying RCMs in a large-scale tissue expression atlas by McCormick and co-authors [55] containing 137 sequencing samples across 48 tissues/stages/conditions (see Methods). From these data we identified 266,710 modifications on 6,805 unique transcripts, representing 19.3% of the expressed (TPM > = 1) Sorghum transcriptome. To determine whether the same repertoire of transcripts are being targeted with RCMs as our seedling data from Fig. 1, or if our data were biased towards identifying a distinct subset of RNAs, we compared the modified RNAs in each dataset. We found 1,813 of the 2,542 modified Sorghum seedling mRNAs are also modified in the McCormick et al., tissue expression atlas. This overlap is substantially more than expected by chance (p < 2.2e-16, Fig. S2 A) and likely reflects similarity in molecular processes within tissues examined.

We next aimed to characterize the tissue specificity of RCMs in the McCormick Sorghum tissue expression atlas. Using a modified calculation for Tau, a value typically used to calculate tissue specificity based on RNA abundance [57], we observed that most transcripts were modified in a very context-specific manner (Fig. S2B). The transcripts with the lowest Tau value, and therefore modified under the broadest context, were mostly associated with core cellular processes such as translation, whereas those transcripts modified in the most specific context were associated with mRNA maturation and abiotic stress responses (Fig. S2C, D). We also clustered transcripts based on the tissue type in which they were modified, the number of modifications identified, and the average number of modifications per transcript (Fig. 2A). In our analysis we observed a strong clustering based on tissue similarity, with seed and roots being notable exceptions. We find a strong bias for RCMs in this dataset towards root samples where both the number of RCMs and number of mRNA targets is substantially higher than other tissues (Fig. 2A and Fig. S2E). In contrast, seeds displayed the fewest number of modified transcripts, but the average number of modifications per transcript was very similar to that seen in leaf and root tissues where the number of modified sites and transcripts were much higher (Fig. 2A). We observed a large number of transcripts that are modified in all tissues (n = 851; Fig. 2B and Fig. S3 A), highlighting the existence of a core repertoire of RCM-targeted mRNAs. As expected, based on their presence across a broad tissue and developmental context, these mRNAs are enriched for terms associated with cytoplasmic mRNA translation (Fig. S3 B). However, the majority of transcripts were more restricted in terms of the tissue or developmental context under which they were modified and were enriched for more specialized GO terms. For instance, root-specific modified mRNAs were enriched for rhizosphere-associated terms [58] such as oxidation management, generic methylation, and aromatic compound biosynthesis (Fig. S3 C). Leaf-specific modified mRNAs were enriched for photosynthetic terms, whereas seed-specific mRNAs were enriched for lipid storage, ABA response, and cold acclimation terms (Fig. S3 D, E). Importantly, the observed increase in context-specific modifications was not simply due to differences in the most abundant transcripts in each tissue. Indeed, we observed a low (although positive, r = .34) correlation between RNA abundance and modification levels (Fig. 2C; top) in the Sorghum tissue atlas. Thus, although there is a core set of modified Sorghum transcripts, most are targeted for RCMs in a context- or developmentally-specific manner.

Fig. 2
figure 2

RCM dynamics across development and during stress. (A) Summary of RCM findings from re-analysis of Sorghum tissue expression atlas [55]. Hierarchical clustering using the number of RCMs per transcripts as input. For ease of viewing, branches were collapsed based on tissue and modification similarity. See Fig. S2E for un-collapsed tree. Tree tip labels denote the broad tissue category with the number after the tissue representing the number of SRAs used to summarize that broad tissue. Modified sites: Boxplots showing the total number of RCMs per SRA in each broad tissue category. Modified RNAs: The total number of modified RNAs (genes) per SRA in each tissue category. Modifications per RNA: The RCM density per RNA per SRA in each broad tissue category. (B) Upset plot quantifying the shared modified RNAs across the broad tissue categories in (A). (C) RCM-RNA abundance correlation distribution of all modified RNAs across the McCormick et al. tissue expression atlas [55] and the Varoquaux et al. [66] Sorghum drought experiment. The dashed red line shows the mean Pearson Correlation Coefficient of each experiment (r = .34 Tissue, r = .18 Drought). (D) Heatmap of RNA abundance and RCM changes in Varoquaux et al. re-analysis. Each row is a modified RNA that has been filtered for RNA abundance variability (see methods). Rows were clustered based on their abundance and RCM density values. The heatmap was grouped into four distinct clusters (C1-C4, n – values for each cluster shown) using k-means. SC = drought susceptible control, SD = drought susceptible treatment, TC = drought tolerant control, TD = drought tolerant treatment. At the time point analyzed, the drought susceptible genotype is RTx430 and the tolerant genotype is BTx642. (E) Enriched GO terms of the modified RNAs in each of the four clusters in (D)

The conservation of RCMs on orthologous genes in our grass seedling data suggest that the developmental and tissue contexts under which these marks are deposited might also be conserved. To address this contextual conservation, we examined RCMs in two publicly available Arabidopsis thaliana tissue atlases [54, 56]. Both atlases examined similar developmental stages, but did so under slightly different conditions (e.g., constant light and sterile MS media for Mergner et al. [54] vs. cycling light and soil for Klepikova et al. [56]). Importantly, while the Klepikova tissue atlas is primarily used by the community to examine transcript abundance in the Arabidopsis EFP browser [59], the work by Mergner et al. performed a paired assessment of transcript and protein abundance, which we used to examine the relationship between modifications and translation (see below). Due to differences in experimental design, we analyzed each of these atlases separately. A full breakdown of examined tissues, as well as total number of modifications identified, can be found in Additional File 2. Like Sorghum, clustering of Arabidopsis tissues by RCM density placed similar tissues together (Mergner et al.: Fig. S4A,B, Klepikova et al.: Fig. S5A, B). While Arabidopsis root tissues did not display a significantly elevated level of RCMs as in Sorghum, Arabidopsis seed transcripts had a reduced pool of very highly modified transcripts in both atlases. Similar to our observations in Sorghum, the seed-specific modified transcripts were enriched for lipid, nutrient, and ABA-response terms (Fig. S6A, B, C, D). Thus, as in Sorghum, the Arabidopsis epitranscriptome is diverse, highly context-specific, and appears to be associated with transcripts critical for cellular function.

Given the similar patterns of RCM abundance in Sorghum and Arabidopsis, we next examined whether orthologous mRNAs between Sorghum and Arabidopsis, which last shared a common ancestor ∼ 150–250 million years ago [60], are targeted by RCMs. While we identified fewer mRNA targets of RCMs in both Arabidopsis expression atlases relative to the Sorghum expression atlas (Mergner: 1,324, Klepikova: 2,495 vs. 6,845 in McCormick Sorghum), the number of modified orthologous mRNAs (658 Mergner vs. McCormick; 1,180 Klepikova vs. McCormick) was significantly more than expected by chance for all possible mRNA combinations (Fig. S6E, F, p  = .001 Hypergeometric test). We conclude that this class of RCMs target an ancient conserved repertoire of translation and photosynthetic related mRNAs.

Given the developmental differences in RCM deposition in both Sorghum and Arabidopsis, as well as reports associating RNA modifications with plant stress responses [9, 14, 61,62,63,64,65], we next determined if RCMs are associated with drought stress in Sorghum. We utilized a publicly available field-grown Sorghum transcriptome dataset that sampled well-watered and water-limited (i.e., drought treatment) Sorghum leaves and roots from drought tolerant and susceptible genotypes across weekly timepoints [66]. We focused on a post-flowering time point (week 10) where one genotype (BTx642) is considered drought tolerant and the other genotype (RTx430) is drought susceptible. Counterintuitively, we observed a shift towards a more negative correlation between RNA and RCM abundance during drought stress relative to the Sorghum tissue atlas (Fig. 2C, bottom). A heatmap comparing transcript and RCM abundance of the top 50% most variably expressed and modified root transcripts (n = 878) between the two genotypes and treatments clustered into four groups (Fig. 2D). One of these clusters (Cluster 3) showed similar increases in transcript abundance levels between the two genotypes under water limiting conditions but showed an increase in RCMs specifically in the drought tolerant genotype (Fig. 2D). An examination of enriched GO terms revealed that Cluster 3 contained both heat shock proteins as well as water transport proteins (Fig. 2E), suggesting that RCMs may be associated with the drought response in the tolerant genotype.

RCMs accumulate near exon-exon junctions and are associated with splicing events

Thus far we have observed an association between RCMs and plant developmental and environmental transcriptional responses. To gain functional insight into RCMs, we analyzed their accumulation and distribution across mRNA topologies using our seedling RNA-seq data for all three species. RCMs were enriched across the CDS, with a bias towards the 3’ CDS, and in the 3’ UTR of mRNAs (Fig. 3A). Like m6A [23], we also observed that RCMs are biased towards being deposited on abnormally long exons (median length of modified exons = 361 nts, median unmodified = 147 nts, median all exons = 148 nts, Fig. 3B, p < 2.2e-16). Additionally, we observed a significantly higher proportion of expressed transcripts that are multi-exonic being targeted by RCMs, relative to mono-exonic transcripts (∼ 11–14% vs ∼ 5–8%, Fig. 3C, p < 7.1e-15). This was also the case for long non-coding RNAs (lncRNAs) where 4.8% of mono-exonic lncRNAs are marked by RCMs (17/355) and 8.4% of multi-exonic lncRNAs are targeted by RCMs (59/704, p = .044; Fig. S7A). A closer examination of distinguishing CDS features uncovered a dramatic buildup of RCMs on both 5’ and 3’ edges of exon-exon junctions (EJs) relative to start and stop codons (Fig. 3D). Thus, these data suggest that RCMs likely play a role in the regulation of RNA splicing on diverse transcript types.

Fig. 3
figure 3

RCMs accumulate at exon-exon junctions and are associated with splicing. (A) Metagene plot showing the genic distribution of grass RCMs identified in Fig. 1. Ten nucleotide windows are plotted where the signal is the sum of RCMs in that window and plotted as density. (B) Density distribution of exon lengths for RCM marked exons vs exons not marked by RCMs. (Student’s t-test, p < 2.2e-16). (C) Proportion of expressed transcripts that receive RCMs in grass seedlings between mono-exonic and multi-exonic transcripts. Multi-exonic transcripts are significantly more likely to be modified than mono-exonic (Chi-squared test, all p-values < 7.1e-15). (D) Density of Sorghum seedling RCMs at start codons, stop codons, and 5’/3 exon-exon junctions. Density curves are plotted over a density histogram, both using a two-nucleotide window. The dotted line represents the first nucleotide of the start and stop codon, and the first nucleotide of the intron for 5’/3’ exon junction panels. (E) Histogram quantifying the number of Sorghum seedling RCMs accumulating at terminal exon-exon junctions vs the first exon-exon junction. (F) Box Plot and dot plot overlaid quantifying the proportion of RNAs that are marked by RCMs or not in each SRAs in the McCormick et al Sorghum tissue expression atlas. The x-axis splits the data by the number of isoforms a gene is expressing. RCM proportion increases as the number of isoforms expressed increases (p < 2.2e-16 one-way ANOVA)

This observation of RCMs preferentially occurring at EJs was initially made by Vandivier et al [14] on degrading mRNAs, whereas degrading mRNAs likely make up a small proportion of our dataset. Thus, these findings suggest a steady-state RCM enrichment at mRNA EJs. A transcriptome-wide 3’ bias was observed for these EJ-enriched RCMs (Fig. 3E). This terminal EJ enrichment was not due to 3’ sequencing bias that is often observed in poly-A enriched transcriptome datasets (Fig. S7B, C, D). Interestingly, genes that express multiple isoforms are more likely to be modified than single isoform transcripts with similar numbers of exons (Fig. 3F, p < 2.2e-16). However, we observe no significant difference in the buildup of modified sites at alternatively spliced junctions vs canonical splice sites (Fig. S8, see Methods). These data suggest that the increase in modification frequency at genes with more isoforms is likely due to the presence of more exon junctions. Thus, RCMs appear to be positively associated with splicing in plants.

RCMs are positively associated with stable and translating mRNAs

Because Mergner et al. [54] measured RNA and protein abundances from matched Arabidopsis samples they were able to correlate RNA: protein abundances across their samples. Therefore, we examined whether mRNAs marked with RCMs displayed a higher or lower RNA: protein correlation. A difference in RNA: protein correlation could suggest a RCM function in RNA stability and/or translation efficiency. mRNAs that are not marked with RCMs across the Mergner et al. atlas showed a lower median RNA: protein correlation (n = 3,361, r = .68) compared to RCM marked mRNAs which showed a significantly higher RNA: protein correlation (n = 332, r > = 0.758, p < .05, Fig. S9A). Due to this positive correlation between transcripts harboring base-pair disrupting modifications and their translation products, it is possible that these RCMs are positively influencing translation, either by reducing structural complexity or by stabilizing these transcripts.

To test for a relationship between RCMs and mRNA decay, we first examined publicly available transcriptomic data from Arabidopsis lines deficient in cytoplasmic mRNA decay pathways [67] by Sorenson and co-authors. Cytoplasmic mRNAs usually decay through three pathways: decapping (5’ -> 3’), the RNA exosome (3’ -> 5’), or an exosome independent 3’ -> 5’ decay pathway. Decapping occurs through a multi-subunit complex that is scaffolded by VARICOSE in plants and metazoans (VCS, [68]). Meanwhile, the exosome independent 3’ -> 5’ decay pathways occurs through SUPPRESSOR OF VARICOSE (SOV), which contributes to the decay of decapped RNAs and is also known as DIS3L2 in other Eukaryotes [69,70,71]. Sorenson et al took an RNA-seq approach to examine mRNA decay dynamics following transcriptional arrest (via cordycepin) in four Arabidopsis genotypes that vary in their cytoplasmic mRNA decay factors (wild-type, sov knockout, vcs knockout, and sov/vcs double knockout). If RCMs are primarily marking mRNAs for degradation (as initially suggested by Vandivier et al [14]), then a buildup should be observed after transcriptional arrest in genetic backgrounds deficient for mRNA decay.

The majority of modified transcripts at the beginning and end of the time series were mRNAs (Fig. 4A). As expected from arresting transcription, each genotype, with the exception of the sov/vcs double mutant, displayed a ∼ 25–50% decrease in the total number of observable protein-coding transcripts eight hours post-treatment (see Methods; Fig. S9B). The number of observed RCMs increased in all genotypes after arresting transcription, as did the proportion of modified mRNAs (Fig. 4B and Fig. S9C), suggesting two possibilities: 1) that modification abundance increases with transcript age, or 2) that non-modified transcripts are degraded more quickly leading to a higher proportion of transcripts detectable over time containing RCMs.

Fig. 4
figure 4

RCMs mark mRNAs that degrade slowly. (A) Reanalysis of Sorenson et al. 2018 Arabidopsis dataset [67]. Distribution of Arabidopsis gene types that receive RCMs at time point 0 and 8 h after arresting transcription. tRNA = transfer RNA, sn/snoRNA = small nuclear/small nucleolar RNA, rRNA = ribosomal RNA, mRNA = messenger RNA, lncRNA = long noncoding RNA. (B) Number of Arabidopsis RCMs predicted at each time point after arresting transcription and in each RNA degradation genotype. (C) Change in number of detectable RNAs over time after arresting transcription. Solid lines represent mRNAs that have a RCM at time point 0 while dashed lines represent mRNAs that have no RCMs at any time point. Expressed genes are those that have a normalized expression value > = 1 (see methods). (D) Comparing RNA abundance change after transcription arrest at each time point. Red boxplots (left grouping) are mRNAs with an RCM at time point 0, blue boxplots (right group) are mRNAs that have no RCMs at any time point. Comparisons at time points 7, 15, and 30, p > .05, all time points starting at 60 min and afterwards, p < .001 by one-way ANOVA. (E) Boxplot showing the distribution of ɑ-decay rates between RCM marked mRNAs at time point 0 and mRNAs that have no RCMs at any time point. (F) Modified figure from Sorenson et al. 2018 Fig. 1C showing the number of mRNAs that belong to different decay categories. Categories 1–14 are inconsequential for our conclusions. Group 15 represents genotype independent decay, (decay rates are not affected by the genotypes in the Sorenson et al. study). This category may represent targets of the exosome. * P-value between modified/never modified for Group 15 calculated by Chi-squared test

To more directly test the association of RCMs with RNA degradation, we analyzed whether the pool of transcripts that are modified at time point 0 were still detectable and modified at subsequent time points. Surprisingly, mRNAs modified at time point 0 were nearly all detectable 8 h after transcription arrest while mRNAs that were not modified at time point 0 declined by more than 40% (Fig. 4C). These results indicate that RCMs mark mRNAs which degrade slower than the entire mRNA pool. Indeed, mRNAs that are not targeted by RCMs show a significantly larger magnitude of TPM decrease relative to mRNAs marked with RCMs at time point 0 (Fig. 4D). These results strongly suggest that RCM marked mRNAs degrade at a slower rate relative to mRNAs that are not marked by RCMs.

Sorenson et al. also modeled the initial decay rate of all mRNAs (alpha decay rate) based on read abundance after transcription arrest. Based on these values, mRNAs that are modified at time point 0 in wild-type show a significantly lower (slower) alpha decay rate relative to all other mRNAs (Fig. 4E). Sorenson et al. used the decay rates across genotypes to assign mRNAs to genotype-dependent RNA degradation pathways (Fig. 4F x-axis groups 1–14, see Fig. 1C from Sorenson et al.). For mRNAs that could not be assigned to the VARICOSE or SOV degradation pathways, Sorenson et al. hypothesized that they were likely targets of the RNA exosome (x-axis group 15; Fig. 4F). Interestingly, modified mRNAs are significantly biased towards being assigned to this group and thus are likely targets of the RNA exosome (Fig. 4F p < 2.2e-16, Pearson’s Chi-squared test). While the importance of RCMs on rRNAs and tRNA stability and function has been known for decades, based on these data, we would argue that the HAMR/ModTect class of RCMs appear to be marks of Pol-II transcript stability, rather than marks of degradation.

RCMs are not associated with nonsense-mediated mRNA decay

To further investigate whether RCM marked mRNAs have an association with RNA degradation pathways, we next turned our focus to the Nonsense Mediated mRNA Decay (NMD) pathway. NMD is responsible for degrading aberrant mRNAs that have a premature termination codon. This is often recognized as a termination codon upstream of the exon junction complex (EJC, [72]). Given the accumulation of RCMs at exon-exon junctions and the recent report of RNA degradation intermediates accumulating near exon-exon junctions [73], we tested whether modified mRNAs were likely targets of NMD. SUPPRESSOR OF MORPHOLOGICAL DEFECTS ON GENITALIA7 (SMG7) is a critical component of early NMD signaling in most Eukaryotes [74]. Gloggnitzer and co-authors [75] performed RNA-seq in Arabidopsis with a loss-of-function smg7 mutant in a genetic background avoiding the strong autoimmune response of knocking out NMD (pad4; [75]). We re-processed the RNA-seq data generated by Gloggnitzer et al. and identified RCMs from their data. The up-regulated mRNAs in the smg7 genotype represent both direct and indirect targets of NMD silencing. We observed strong statistical evidence that NMD targets are actually under-represented (that is, depleted) from mRNAs containing RCMs (5,275 mRNAs containing RCMs, 656 up-regulated mRNAs in smg7, overlap = 103, p < 2.2e-16, Fig. S10). Furthermore, we identify no significant differences in RCM distribution at exon-exon junctions between pad4 and smg7-pad4 RCMs (Fig. S10B, p = .986), or between smg7 RCMs in mRNAs up-regulated in smg7 vs. those not differentially abundant in smg7 (Fig. S10C, p = .144). In agreement with the smg7 results which examined a single tissue, there is no significant overlap between predicted NMD targets and the RCMs identified in the Mergner et al. tissue atlas (Fig. S10D, p = .304). Collectively, these results suggest RCMs are not associated with the NMD pathway.

Discussion

In this study we used a bioinformatic workflow consisting of HAMR and ModTect [14, 15, 44] to predict RCMs in diverse species across the flowering plant lineage in order to clarify RCM distribution and putative function on mRNAs. Of note, because these algorithms rely on a certain minimum depth of expression in order to statistically call a transcript as being modified, and we tended towards a conservative definition of when to call an RCM (present at the same site in bio-replicates based on both algorithms), making conclusions about low abundance transcripts can be problematic. Despite these caveats, we observed a substantial number of transcripts whose modification status changes along developmental or environmental gradients. In perhaps the most extensive comparative analysis to date, we demonstrated that the RCMs detected by HAMR and ModTect (e.g., those that fall along the WCF base-pairing edge) are found on a large, yet contextually discrete set of transcripts. From these diverse analyses we believe that a number of conclusions can be made that provide lessons as to the function of RCMs in eukaryotes.

RCMs are not found on all expressed transcripts, nor are they always on the most abundant transcripts, suggesting a contextual specificity. Indeed, there appear to be two classes of transcripts that receive RCMs in plants. There are a core set of transcripts associated with photosynthesis and cellular metabolism that are modified in each tested tissue or developmental context. These transcripts tend to be abundantly modified and are the targets of all of the major base-pair disrupting modifications. Due to their critical functions, these genes are conserved, as is their modification status, across the plant lineage. The other class of modified transcripts receive RCMs in a more context-specific manner. These transcripts may be expressed in multiple tissues but are generally targeted by modifications in a more limited subset of conditions. Functional enrichment suggests that these transcripts are modified in response to environmental or developmental cues.

We observed similar patterns of RCM deposition on all transcripts, with a bias towards longer exons and transcripts containing multiple exons. In contrast to m6A, these RCMs are enriched at exon splice junctions. Interestingly, while we see an increase in RCM abundance on transcripts with multiple isoforms, we see no association between RCM status and alternative splicing or intron retention, suggesting that modifications are more of a general marker of splicing rather than a driver of isoform selection. We would propose that these modifications are deposited on the base-pairing edge of the newly transcribed mRNA to reduce structure at exon junctions in order to facilitate splicing, whereas m6A is occluded from these sites due to its interference with binding of the splicing machinery [33, 34]. This model would further suggest that, like m6A, these other RCMs are deposited co-transcriptionally and thus their writers should appear in complexes with transcriptional and epigenetic machinery. In support of this hypothesis, the m5C methyltransferases NSUN3 and DNMT2 are known to interact with hnRNPK for Pol-II recruitment and modulating chromatin state in response to stress [76].

Previous work in Arabidopsis comparing RCM abundance in degrading and total mRNA populations found evidence for RCMs as a marker for mRNA degradation [14] whereas others have suggested that loss of RCMs leads to unstable mRNAs [10]. In an attempt to resolve these conflicting models, we turned to an elegant time-series RNA-seq experiment in Arabidopsis examining the impacts of mutants in the mRNA decay pathway on RNA turnover [67]. We used these data to assess the relationship between degradation and RCM status. To our surprise, we found that RCMs are more likely to mark extremely stable mRNAs, rather than those that are rapidly degrading. Our finding that most of the RCMs fall on transcripts associated with core cellular processes fits with the notion that these transcripts would also be more stable, a finding reported by Sorenson et al. and others [77]. Interestingly, we also observe a positive relationship between RCM abundance and protein abundance in the Arabidopsis tissue atlas, further suggesting that RCMs, as a whole, have a positive influence on mRNA stability. Importantly, we cannot definitively say that degrading RNAs are not marked, just that the RCMs we can monitor are enriched on mRNAs with long half-lives. A key difference between our work and that of Vandivier et al. [14] is in the depth of sequencing, both for the grass seedling experiments that we performed and for the additional datasets we reprocessed, and so further examination of the connection between these base-pair disrupting RCMs and RNA degradation is warranted.

Conclusions

In this work, we used a comparative transcriptomic approach to better understand the conservation and function of a poorly studied class of RNA modifications. Grass seedlings contained similar cohorts of modified transcripts, a finding that spurred us to look in other datasets that incorporated additional stress or developmental timepoints in both Sorghum and Arabidopsis. Again, we observed a conserved, core set of modified transcripts, but these expanded datapoints allowed us to uncover context specificity to the epitranscriptome, suggesting that some, but not all, transcripts are modified at key steps in their functional lifecycle. Attempts to better understand function of these base pair disrupting modifications revealed that they are likely important for splicing, translation, and intriguingly, for RNA stability. Similarities across the Angiosperm lineage, along with supporting literature in metazoans, suggests that these modifications, like m6A, are a deeply conserved aspect of RNA biology for which we still know very little.

Methods

Plant growth

Seed from Zea mays (acc. B73), Sorghum bicolor (acc. BTx642), and Setaria italica (acc. Yugu1) were sown onto damp soil approximately 2 cm below the surface, and stratified for one week in the dark at 4 °C. Seeds were then transferred to a growth cabinet with lights configured to a long-day cycle (16 h light, 8 h dark). All species’ seeds germinated within a day of each other and were allowed to grow for two weeks once germinated. Seedling shoot tissue was collected and instantly frozen with liquid nitrogen.

RNA-sequencing library preparation

RNA from two biological replicates in each species was isolated using TriZol per the manufacturer’s instructions (ThermoFisher #15596026). Paired-end and strand specific sequencing libraries were generated for each species using the YourSeq FT Strand-Specific mRNA library prep kit (Active Motif #23001). Libraries were sequenced on a Novogene Novaseq platform.

Read processing and modification identification

Raw sequencing reads were trimmed for adapters and low-quality bases using fastp (v 0.23.4, [78]) with default settings for paired end reads. Trimmed reads were then used to call HAMR modifications using the PAMLINC workflow (https://github.com/chosenobih/pamlinc). Briefly, this workflow maps reads to each species’ reference genome (obtained from Ensembl Plants in January 2022) using BowTie2 (v 2.2.5, [79]) and retains unique alignments that don’t overlap exon-junctions or deletions. The HAMR statistical model is then run on these filtered alignment files (BAM files) to predict RCMs.

In addition, a similar algorithm, ModTect (v 1.7.5, [15]), was used to call RCMs. For ModTect prediction, trimmed reads were mapped to the same Ensembl Plants genomes and associated annotation files as above using STAR (v 2.7.10b, [80]). The following non-default STAR arguments were used for mapping: --outFilterIntronMotifs RemoveNoncanonical, --alignIntronMax 10000, --outSAMstrandField intronMotif. Raw BAM outputs were then used as input for ModTect with the following arguments: –minBaseQual 30, --readlength 150, and --regionFile genes.bed which is a four-column file of all gene regions in the annotation file.

For sequencing data generated in this study, RCM sites were retained that overlapped both ModTect and HAMR predictions among both sequencing replicates (Fig. 1B). For public sequencing data, RCMs were identified using only ModTect and RCMs were retained that were common among three or more experiments (i.e., three or more distinct SRAs).

RNA abundance measurement

Trimmed fastq files were used to measure RNA abundance using Salmon (v 1.10.0, [81]) in “quant” mode against a decoy aware transcriptome index. The tximport (v 1.28, [82]) R package was then used to import Salmon quant files and generate gene or transcript level transcript per million values, specifying the following tximport option: countsFromAbundance = “lengthScaledTPM”. RNAs used for downstream RCM and expression analyses were those with at least one transcript per million in one experiment. For the reanalysis of the Varoquaux et al., dataset in Fig. 2D, root tissue was chosen for comparison. Transcripts whose RNA abundance did not vary in the root control vs. drought treatments were removed by filtering out the bottom 50% of transcripts based on their abundance variability measured by median absolute deviation. The frequency of RCM density for every transcript (number of RCMs in a condition / total number of RCMs across all root samples) was computed and then transcripts without RCMs in any dataset was removed. For the reanalysis of Sorenson et al. 2018, we used RPKM (reads per kilobase per million mapped) values that the authors provide in their supplemental data which accounts for stable transcripts containing an increasing proportion of sequencing reads.

Alternative splicing

Two approaches were used to infer alternative splicing. First, genes that express multiple isoforms can be inferred from Salmon [81] using the tximport [82] R package as described above. Specifying the tximport “txOut” argument gives isoform level quantifications. Isoforms expressed above one TPM were retained. Genes expressing more than one isoform were inferred to be undergoing alternative splicing. Second, we examined statistically significant changes in splicing patterns between experimental conditions using EventPointer [83].

GO term enrichment

GO terms were downloaded for each species from Ensembl Plants. GO term enrichment was performed using the clusterProfiler (v 4.8.2, [84]) R package, specifically the “enricher” function was used, specifying: (1) the input gene set, (2) a p-value cutoff of 0.05, (3) Benjamini and Hochberg multiple testing correction, (4) a background gene set (usually, all expressed genes (TPM > = 1)), (5) a minimum and maximum gene-set size of 10 and 500, respectively, and (6) a q-value cutoff of 0.05. Fold enrichment was calculated using the following functions: parse_ratio(GeneRatio) / parse_ratio(BgRatio).

Data analyses

All analyses and statistical testing was performed using the R programming language (v 4.2.0, [85]). Tests for over or underrepresentation were performed using the phyper function (i.e., a hypergeometric test) in R. Correlations between RNA abundance and RCMs were performed by collecting the RNA abundance values and RCM density values (number of RCMs per transcript) from each sequencing experiment of interest. Pearson Correlation Coefficients were then calculated from the RNA abundance and RCM density using the “cor” function in R. To measure statistical significance of mean-separation between groups, the Student’s t-test was used for two group comparisons, while one-way Anova was used for more-than-two group comparisons.

All visualizations were generated in R from various packages including: ggplot2 (v 3.4.4, [86]), ggvenn (v 0.1.10, [87]), enrichPlot (v 1.20.0, [88]), complexUpset (v 1.3.3, [89]), complexHeatmap (v 2.16., [90]), and the ggtree package (v 3.8.2, [91])