Introduction

Techniques for sequencing RNA and DNA pioneered by Fred Sanger and others in the 1960s [1] and 1970s [2] began to reveal the biochemical recipes for storing biological information in organisms and laid the foundation for modern genomics. Yet, decades before the first nucleic acid was sequenced, various chemical modifications of DNA had already been described, such as 5-methylcytosine [3] and 5-hydroxy-methylcytosine [4], now dubbed the 5th [5] and 6th [6] base of genetics; in total, several dozen DNA modifications have been reported [7]. These modifications, along with histone modifications, are now recognized as important regulatory mechanisms for controlling gene expression and function [8].

Fortunately, it is now relatively easy to characterize these modified DNA bases, which form part of the 'epi'-genome (epi, on top), for any organism with a finished genome, given the widespread availability of high-throughput techniques, especially those based on next-generation sequencing (NGS). Various NGS approaches are being used in the National Institutes of Health (NIH)'s Epigenomics Roadmap [9] and in the BLUEPRINT Project [10]. Similarly, cell-specific, post-translational modifications of proteins, sometimes referred to collectively as the 'epiproteome' [11], are essential mechanisms necessary for the regulation of protein activity, folding, stability and binding partners. Elucidating the roles of protein and DNA modifications has had a major impact on our understanding of cellular signaling, gene regulation and cancer biology [12].

However, our understanding of an additional regulatory layer of biology that rests between DNA and proteins is still in its infancy; namely, the multitude of RNA modifications that together constitute the 'Epitranscriptome'. There are currently 107 known RNA base modifications, with the majority of these having been reported in tRNAs or rRNAs [13]. Outside the 5' cap, the role of modifications in mRNA is unclear [14, 15]. One RNA modification, N6-methyladenosine, or methyl-6-adenosine (m6A), has been observed in a wide variety of organisms, including viruses [16], yeast [17], plants [18], humans [19, 20] and mice [19, 20], and exhibits dynamic changes in response to a variety of stimuli in yeast [21]. Older studies using purified polyadenylated RNA from mammalian cells showed that m6A was the most abundant post-transcriptional modification in polyadenylated RNA [14], which contemporary doctrine considered to be synonymous with mRNA. However, it is now known that polyadenylation occurs not only on mRNAs, but also in other RNAs, such as rRNAs and long intergenic noncoding RNAs (lincRNAs). Thus, it was historically unclear exactly how m6A existed in mRNAs and, if so, whether it was restricted to a select few transcripts or prevalent throughout the transcriptome.

Previous methods for investigating the prevalence of m6A were laborious and involved incubating cells with 14C-radiolabeled methionine (the precursor for the endogenous methyl donor, S-adenosylmethionine), following which the incorporation of methyl groups into RNAs could be quantified. These early studies detected methylated bases in ribosomal RNA (rRNA) [22], small RNA fractions [2327] and in mRNAs [28]. However, these methods were limited by their inability to identify the specific mRNAs that contained m6A. Indeed, m6A had previously been detected in vivo for only a single mammalian mRNA (bovine prolactin [29]), and the specific sites of m6A incorporation had been established for only two RNAs: prolactin [29] and Rous sarcoma virus RNA [30, 31]. The methods used to map these m6A sites were technically challenging and, more importantly, required a pre-ordained focus on a particular transcript, rather than a global approach that could detect sites of adenosine methylation in all mRNAs. Moreover, adenosine methylation is invisible, insofar as both methylated and non-methylated adenosines readily base pair with T or U, and both are reverse transcribed to T, further hindering the study of m6A and its role in biology.

However, a renewed interest in m6A has recently emerged, partially due to the finding that the fat mass- and obesity-associated (FTO) gene encodes a brain- and hypothalamus-enriched m6A demethylase that is responsible for converting m6A to adenosine [32]. Defects in this enzyme result in significant alterations in energy use and metabolism, and mutations in FTO have recently been linked to a higher risk for Alzheimer's disease and decreased brain mass [33, 34]. These studies suggest that m6A may have a physiological role in cellular signaling and neurodegeneration. Recent advances in NGS technology, in addition to the availability of antibodies that recognize m6A, have enabled the development of global approaches for studying m6A. Recently, two groups have independently developed high-throughput methods for rapid characterization of m6A sites across the transcriptome. Methods such as methyl-RNA-immunoprecipitation-sequencing (MeRIP-seq) [19] or m6A-seq [20], which combine immunoprecipitation (IP) of methylated RNAs using an m6A-specific antibody, with NGS, have finally opened the door to global methods for studying the epitranscriptome and its dynamics.

Mapping the epitranscriptome

Although MeRIP-seq and m6A-seq were developed independently [19, 20], both are very similar in the initial RNA preparation and IP steps. The larger differences between the two protocols lie in their downstream computational methods rather than in sample preparation, which in both cases followed methods similar to existing chromatin IP-seq (ChIP-seq), insofar as they performed IP with an m6A-specific antibody. Table 1 shows the similarities and differences between the MeRIP-seq and m6A-seq protocols.

Table 1 Comparison of MeRIP-seq and m6A-seq

Both datasets produced qualitatively similar results, with m6A peaks in introns, 5' UTRs, exons, splice junctions, ncRNAs and intergenic regions, indicating that m6A is a widespread and wide-ranging RNA modification. The MeRIP-seq study also identified peaks in lincRNAs. The genomic features for which m6A was particularly enriched are listed in Table 1; the main discrepancy between the two studies in this regard concerned the enrichment of m6A at transcription start sites (TSSs), which was observed by m6A-seq, primarily in a single cell line (Figure 1a). An explanation for this discrepancy may be the different window used to define the TSS. A comparison between mouse and human data in both studies showed a high conservation of specific m6A sites across the two species. Finally, digesting samples with various RNases prior to MeRIP-seq demonstrated that m6A sites were mostly present at internal sites within mRNAs and were absent from polyA tails.

Figure 1
figure 1

Peak distribution. (a) We plotted the distribution of the peaks reported across gene bodies by the MeRIP-seq and m6A-seq studies. Note the very well defined enrichment for peaks near the stop codon and in the 3' UTR. The m6A-seq HepG2 peaks also show a peak in the 5' UTR. (b) The distribution of peaks across the transcriptome using the BWA-based MeRIPPeR pipeline [35] on the data from both groups. Data from [19] and [20]. CDS, coding sequence.

In addition to sequencing, the MeRIP-seq study also used immunoblotting to investigate m6A, demonstrating that m6A is present in mouse heart, lung, brain, liver and kidney tissues, with a particular enrichment in brain, liver and kidney. High levels of m6A were found in HepG2 and MCF7 cells, in contrast to lower levels detected in other human cancer cell lines (PC3 and PC9). The dynamic nature of m6A was confirmed by comparing embryonic with adult tissue, which showed that m6A levels increase over the course of development. The m6A-seq study also found m6A to be a dynamic modification, finding that its distribution changed in response to a variety of external stimuli (ultraviolet, interferon gamma, hepatocyte growth factor and heat shock), although as many as 70 to 95% of the peaks were static.

Experiments leveraging the depletion of the METTL3 subunit responsible for methylating adenosines were used in the m6A-seq study to explore the modification's function. A statistically significant increase in the abundance of alternatively spliced transcripts was observed as a result of this depletion, with the alternatively spliced exons and introns showing an enrichment for m6A peaks. However, a permutation analysis of splice junction-localized m6A sites in the MeRIP-seq study data did not find a statistically significant enrichment of m6A peaks in the proximity of splice junctions [19]. Moreover, an analysis of the total mapped bases from the MeRIP-seq samples versus the control, non-IP RNA samples showed that fewer bases mapped to splice junctions in the IP samples (Additional file 1). Elucidating whether m6A functions in splicing and, if so, whether this is direct or indirect through the regulation of splicing factor-encoding transcripts, will require further investigation. In light of the MeRIP-seq data, we suggest that m6A is not likely to cause an overall increase in the global amount of transcript splicing, but it may modify splicing for certain classes of genes, and particularly for genes with alternative, internal exons [20].

Challenges of epitranscriptomic site detection

There are many factors to consider when computing the m6A enrichment for a site. For example, the definition of gene regions, the gene isoform used, the presence of secondary structure, the alignment method and the read depth can all impact the degree of enrichment discovered. Given that epitranscriptomics is a nascent field, computational analysis methods are only now emerging. Here, we explore the impact of these factors on detecting and quantifying m6A.

A comparison of the raw peaks found in the MeRIP-seq and m6A -seq studies showed surprisingly little overlap, even with a single base threshold (Table 2). However, the different cell types used, as well as slightly different RNA handling methods, likely account for some of these peak differences [19, 20]. Another technical contributor to the low overlap may be erroneous methylation calls, given that the false discovery rate (FDR) for both methods was 5 to 7%. The overlap notably improved when bases in transcripts with low expression levels were excluded (Table 3). These complications notwithstanding, the total number of m6A sites identified encompassed the majority of human genes and almost one-third of mouse genes (Table 2).

Table 2 Overlap of genes with m6A peaks
Table 3 Overlap of commonly expressed genes with m6A sites

We re-analyzed the datasets from both studies in order to determine the effect of the peak-calling method on the apparent m6A distribution in the transcriptome and found two discrepancies. By comparing peak-calling methods, we observed that the presence of the 5' UTR peak in the m6A-seq dataset was attenuated when that study's peak-caller was replaced by MeRIPPeR [35] from the MeRIP-seq study. This reduction indicates that each of the two peak-calling algorithms may have different sensitivities and specificities.

We also note that the 5' UTR m6A signal in the original peak sets used a definition of a TSS that was 150 bp, and that the signal was primarily derived from larger genes, as measured by exon number (Figure 2); a potential m6A enrichment within these genes can be seen at the end of the first intron and at the beginning of the following exon (Figure 2). Since the distribution of the m6A sites is non-uniform (in the sense that enrichment is dependent on the number of exons in a gene and that on the context of an exon within a gene), a useful method to examine m6A localization, or any epitranscriptomic change, may be to separate the genes into their sub-geographies (Figure 3).

Figure 2
figure 2

Peak distribution across the transcriptome. The peak distribution depicted is the average across the entire transcriptome. Peaks are mapped to transcripts and assigned to the following transcriptomic features: 1 kB upstream from the TSS and downstream from the transcription end site, 5' and 3' UTRs, coding segments (CDS), and exon and intron segments. In the bottom row, peaks mapping to transcripts with four or more exons are shown, with the first, penultimate and last exons separated into individual boxes, as are their neighboring introns. The remaining exons and introns are shown in the middle boxes as a contiguous segment. Genes with only two or three exons are shown in the middle row and single exon genes are shown in the top row. Data from [19] and [20].

Figure 3
figure 3

Distribution of [AG]ACU motif sites. The [AG]ACU motif was used to find potential m6A sites within peaks, and the distribution of these potential sites across gene bodies plotted. Data from [19] and [20].

m6A mapping is heavily impacted by the read depth and by the choice of peak detection and alignment methods

To further investigate the analytic dependency of m6A peak detection, we examined the m6A site detection as a function of alignment method, antibody and read depth. Part of the challenge of MeRIP-seq analysis is a reliance on other IP-seq analysis methods, developed for chromatin IP-seq (ChIP-seq). ChIP-seq experiments are designed to characterize DNA-histone and DNA-transcription factor interactions. Existing ChIP-seq peak-finders take advantage of inherent properties of the data to assist in finding peaks, many of which do not apply in the case of finding m6A sites in RNA. For example, each fragmented RNA molecule pulled-down by an m6A antibody has the potential to harbor far more methylation sites than the maximum number of protein binding sites expected for the equivalent ChIP-seq fragment, and so the m6A sites are more challenging to resolve. ChIP-seq peak finders use different methods and heuristics to find peaks, attempting to balance finding weak peaks with maintaining a low FDR and resulting in a diverse group of peak sets [36, 37]. The same is true for m6A peaks, as the MeRIP-seq study used Fisher's exact test and the m6A-seq study derived a window score based on peak enrichment.

Each of the multiple methods for aligning reads to a transcriptome has its own set of advantages and challenges. A genome-based aligner, such as BWA [38], can be used when a genome sequence is available, but introduces added complexity when reads map to multiple transcript variants, and suffers from being unable to align reads to genomic regions that are absent from a pre-defined reference. Alternatively, a gap-based aligner, such as TopHat [39] or GSNAP [40], can be used, with the advantage that these algorithms are designed for transcriptomes and so can map reads across both known and novel splice junctions. However, these methods tend to be slower and can introduce many false splice sites, leading to poorly aligned reads. The ability of an aligner to handle errors typical of RNA-seq, which differ to those seen in DNA sequencing, is another factor to consider. A common source of error in RNA-seq is the random hexamer priming used in cDNA synthesis, which introduces a bias in the nucleotide distribution at the beginning of reads [41]. One possible solution to this particular error is to trim the reads, an approach that was employed in the m6A-seq study.

To examine the effect of aligner on the detection of m6A peaks, we analyzed processed HEK293T MeRIP-seq data using three aligners (BWA [38], TopHat 2 [39] and GSNAP [40]), and then called peaks with MeRIPPeR [35]. We observed a slight increase in the number of 5' UTR peaks when using the transcriptome aligners GSNAP and TopHat 2 relative to the number called when using BWA (Additional file 2). More importantly, there was a significant increase in the number of individual peaks: MeRIPPeR found 19,617 peaks using BWA, 45,738 with GSNAP and 135,706 using TopHat 2, all at the same FDR (0.05). These results indicate that the alignment method selected has a significant impact on the number of peaks identified in a MeRIP-seq dataset.

To effectively gauge the influence of read depth on m6A site detection, we used a sub-sampling titration analysis of the aligned reads. We found that peak detection is heavily dependent on read depth (Additional file 3a), with some aligners showing a nearly linear increase in peaks as a function of depth. The number of genes in which these peaks were found also increased with read depth, albeit less dramatically (Additional file 3b), with the number of genes continually increasing as a function of depth. While a specific point in a transcript might be correctly called as an m6A site, it is not known if the site is methylated in all copies of that transcript [19]. The percentage of transcripts at which a site is methylated may be quantified as the stoichiometry of m6A. It is likely that the new peaks detected with increasing read depth are low in m6A stoichiometry and hence more challenging to detect at lower read depths. From these data, we extrapolate that, given enough tissues, cell types and conditions, it is possible that almost all genes may be marked, at some point, by m6A.

We next sought to establish whether m6A peak calls vary with the antibody used, by separately plotting peaks obtained with the two different antibodies in the MeRIP-seq study. Both antibodies had the same peak distribution across gene bodies (Additional file 4), indicating that the choice of antibody, at least for the two tested, should not impact the global distribution of m6A sites.

m6A motif analysis

A primary motif [AG]ACU was discovered within m6A peaks by both studies, each of which used a different motif-finding algorithm, and both analyses suggest that the A in the canonical motif is the methylated site - agreeing with prior work in m6A sequence specificity [42, 43]. Both groups found the motif to be highly enriched in peak regions compared with negative control regions. If the A in the motif is indeed the m6A, then application of this information to m6A-seq or MeRIP-seq datasets could enable the mapping of m6A sites at single base pair resolution. We used a motif pattern-matching algorithm from FIRE [44] to find the [AG]ACU motif in the MeRIP-seq mouse dataset (Methods), and subsequently applied the assumption that the A in each motif is equivalent to an m6A site, to identify m6A sites in all the datasets. We identified 21,004 m6A sites from 10,488 m6A-seq HepG2 peaks, 46,293 from 17,071 MeRip-seq HEK293T peaks, 9,124 from 4,054 m6A-seq mouse liver peaks, and 37,459 from 12,664 MeRIP-seq mouse brain peaks. Only about 5 to 15% of the peaks lacked the motif sequence and the distribution of these putative single base-resolution m6A sites across gene bodies is very similar to the peak distribution (Figure 1a).

However, we did not observe an enrichment of m6A sites in the 5' UTR, and the coding sequence profile is fairly flat until the peak reaches the proximity of the stop codon. This could indicate that the identified [AG]ACU motif is specific to those peaks near the stop codon, or that the peak enrichment near the 5' UTR does not reflect a true increase in the number of actual m6A sites. To test whether the motif was specific to stop codon-proximal regions, we performed a FIRE [44] motif finder analysis of the 5' UTR peaks that were present in the MeRIP-seq mouse liver dataset, since this dataset was not enriched for this motif in this genomic region. Nonetheless, FIRE found a [CG]ACU motif, though not the strongest motif, indicating that it is not specific to the stop codon peaks, and thus likely a global motif for m6A, but perhaps weakly represented in the 5' UTR.

Single-molecule approaches to RNA modifications

Single-molecule sequencing has the potential to provide base-level resolution of m6A sites, without the need for motif-based inference. The most commonly found platform for this method of sequencing currently on the market is the single-molecule, real-time (SMRT) technology (Pacific Biosciences). SMRT sequencing uses thousands of zero-mode waveguides (ZMWs) to capture an enzyme in real time, traditionally a DNA polymerase, as it incorporates fluorescent nucleotides into a polymer [45]. This method of molecular monitoring has the advantage of detecting both genetic and epigenetic information simultaneously, since the patterns of base incorporation by the polymerase are contingent upon the steric and sequence contexts of the bases present in the template [46]. Specifically, if a modified base is present on the template, the biophysical dynamics of DNA polymerase movement and base incorporation are affected, creating a unique kinetic signature before, during and after base incorporation, and thus enabling identification of specific DNA modifications [47].

Here, we report a novel application of this technology, which can be used to detect modified bases within RNA, including m6A sites. To characterize m6A sites in RNA at single-nucleotide resolution, we used a reverse transcriptase as the enzyme within a ZMW, instead of a DNA polymerase, and this substitution allowed the direct observation of cDNA synthesis in real time. While base incorporations during reverse transcription typically occur at standard speeds, the incorporation of synthetically designed m6A sites showed that there is a significant increase in the inter-pulse duration (IPD) when a methylated adenosine is present in the RNA template, relative to the IPD for a standard adenosine (Figure 4). To our knowledge, this represents the first demonstration of a reverse transcriptase-based kinetic signature that can directly detect modified RNA. However, current single-molecule technology is not without its own challenges. First and foremost, reverse transcriptases stutter when incorporating bases, complicating the accurate reading of homonucleotide stretches and the base resolution of m6A therein. Second, the current throughput is too low for transcriptome-wide approaches. Notwithstanding these caveats, the SMRT technology has the clear potential to detect an underlying epitranscriptomic change in a native RNA template.

Figure 4
figure 4

Single-molecule sequencing of RNA to detect epitranscriptomic changes. SMRT sequencing with the Pacific Biosciences RS shows longer times (inter-pulse distances) to incorporate m6A versus standard adenosines. (a) Experimental design for using a DNA primer in a reverse transcription reaction. Sequencing of the unmodified template shows, in a single-molecule sequencing trace, base incorporation via a reverse transcriptase-mediated cDNA synthesis reaction. (b) Shows sequencing as with (a), but using an RNA template with m6A instead of normal adenosines. Incorporation of thymines (T) show significant delay (longer inter-pulse distances). A.U. stands for normalized arbitrary units in fluorescence measurement. (c) Exponential fit of experimentally observed inter-pulse distances (IPDs). (d) Shows the difference between the average IPDs for native As and m6As. The average IPD in each case is the reverse of the exponential decay rate. The error bars indicate the range around each average IPD that includes 83% of the observed IPDs (that is, ±½ of standard deviation of the exponential fit). We used an Ansari-Bradley test in Matlab to confirm that the distribution functions were different (P = 0.0043).

Similarly, Oxford Nanopore Technologies (ONT) and other companies are developing nanopore-based sequencing technologies, which use nanopore-forming proteins to sequence DNA by attaching an application-specific integrated circuit to the membrane upon which the nanopore rests. In principle, observations of any modified DNA or RNA base could be made during transit of the molecule through the nanopore, and some observations have already been made with nanopores that allow detection of 5hmC [48]. While all of these technologies are still under development, we note that all direct-observation methods, in principle, have the potential to detect m6A and other epitranscriptomic modifications.

Beyond m6A: mapping the full epitranscriptome

As mentioned previously, m6A is only one of many known epitranscriptomic modifications, the majority of which are located in tRNA and rRNA transcripts. We used data from the RNA Modification Database to summarize the catalog of known RNA modifications by species, RNA type and base (Figure 5) [13].

Figure 5
figure 5

Known types of RNA modifications. Known modifications to RNA bases are grouped by RNA type, base and species: (a) archaea; (b) bacteria; (c) eukarya; (d) all species. Data are compiled from the RNA Modification Database [13].

Interestingly, the enzyme commonly known as DNA methyltransferase-2 (DNMT2) [Swiss-Prot: O14717] was shown to methylate cytosine 38 of tRNAAsp [49], and with such high specificity that it was renamed tRNA aspartic acid methyltransferase 1 (TRDMT1). More recently, two more tRNAs were found to be methylated by TRDMT1, and it was also observed that the methylation protects the tRNA from stress-induced cleavage and improves its stability [50, 51]. Several tRNA nucleoside modifications have been shown to control frame shifting and codon binding during translation. These types of modifications often occur in the crucial 7 bp anticodon stem and loop (ASL) region that binds to mRNA codons in ribosomes, and are hypothesized to affect the stability and codon binding affinity during translation by controlling the overall shape of the loop and its dynamics [5254]. Taken together, a pattern emerges in which RNA modifications in multiple RNA species act as a critical regulatory layer of RNA biology.

Many RNA modifications would benefit from a more global and cross-species characterization than is present in the existing literature. For example, studies in Escherichia coli and yeast have shown that nucleotide modifications in rRNA lie in functionally significant regions, with a possible role in the regulation of translation [55]. Another example is methylation in plant rRNAs, where the modification is thought to help maintain rRNA stability, possibly in order to sustain ribosomal function during dramatic changes in temperature [56]. Interestingly, rRNA modifications in trypanosomes were shown to be mediated by small nucleolar RNAs (snoRNAs) [57], and changes in pseudouridylation of rRNA in mice, induced by mutations in DKC1 [Swiss-Prot: Q9ESX5], led to the onset of dyskeratosis congenital, resulting in an increase in tumor susceptibility [58].

Taken together, these studies demonstrate the possible significance and functional importance of (r/t/m/mi/sno/linc)RNA modifications and begin to sketch out what might be called a transcriptomic regulome, where various species of coding and noncoding [59] RNAs, as well as their modified epitranscriptomic variants, compete with, coordinate and control each other during normal cellular processes, from the birth of a transcript until the production of its subsequent protein product or localization of its cellular target.

Conclusions

Characterizations of m6A across the transcriptome show that m6A is present in the majority of mammalian genes, and is highly enriched at the beginning of the 3' UTR and near the stop codon. Yet, many peaks exist in intergenic regions or in introns, and there is some evidence that m6A functions in the regulation of splicing or other modifications that take place in the processing of RNA into a mature transcript. Since m6A distribution has already been shown to undergo developmental changes and differences in cancer cell lines, it is also possible that epitranscriptomic signatures may be used to stratify various states of disease, just as in epigenetics [60]. Despite these advances, the complete purpose and molecular function of m6A is still unknown.

Nonetheless, some reasonable hypotheses can be proposed from the existing data. The enrichment of m6A sites near the stop codon suggests that the modification could play some role in regulating translation termination, potentially by altering translation efficiency or ribosome occupancy. In addition, m6A may mark transcripts for shuttling to RNA granules or for other mechanisms that will preserve the RNA for later use. Just as the number of known modifications of RNA has rapidly expanded (currently 107), the number of known RNA-binding proteins similarly keeps growing, and it is possible that some of these may be responsible for altering the function of m6A within RNAs, either directly or through the regulation of FTO or METTL3. Such interactions could occur at any point of transcription, post-transcriptional modification or translation, with different consequences at each stage in the life of an mRNA. Finally, it is also possible that some RNA binding proteins may be m6A site scanners that bind selectively to either methylated or unmethylated RNA, and as such would be regulated by the epitranscriptomic state of an RNA.

Two additional avenues warrant consideration when discussing possible regulatory functions of m6A. First, even though an inverse spatial relationship was observed between m6A peaks and microRNA (miRNA) binding sites in 3' UTRs [19], it is notable that brain tissue is enriched for both highly expressed miRNAs and m6A-containing genes, which suggests that miRNAs might influence the methylation of a targeted mRNA. In addition, recent work has shown an interplay of mRNA methylation and the reduction of Dicer activity, thus decreasing miRNA maturation rates [61]. Second, m6A has already been shown to inhibit RNA editing in certain cases [62], implying that m6A may serve as the long-sought balancing mechanism for the prevention of RNA editing [63]. If it is the case that m6A prevents RNA editing from occurring, then evidence for this should be apparent in a diminished overlap between m6A and the target RNA editing sites. So far, this appears to be true [19], but the number of sites examined is too low to be definitive yet. If upheld with additional experiments, these feedback and regulatory loops may help explain the genesis and changes in RNA editing sites and miRNA levels, and provide additional mechanisms for controlling gene expression and RNA function.

In summary, the high-throughput and single-molecule methods described here represent the dawn of new research into a novel, RNA-based regulatory layer in cells, which adds yet another component of regulatory complexity to the central dogma of molecular biology (Additional file 5). The high conservation of specific m6A sites across mouse and humans, as well as the general increase in PhyloP conservation scores of the m6A sites themselves [19], both indicate that m6A is under strong evolutionary selection pressure, and thus may represent a critical modification for many organisms. Even though previous evidence indicates that m6A is an RNA modification present in all species, it has so far only been examined on a transcriptome-wide basis in two species (human and mouse), and observed in mRNAs only in eukaryotes, leaving open a wide area of research for many eukaryotic and prokaryotic systems. Just as the protein translation code and epigenetic code have slowly accreted into a cogent framework for information transfer and regulation within the cell, and between generations, these data indicate that an important epitranscriptome code is emerging. Notably, this dynamic code already appears to greatly expand the function and regulatory potential of all information contained within the many species of RNA present in a cell.

Methods

Sequence data were realigned to the genome using BWA [38], TopHat 2 [39] or GSNAP [40]. BWA was run using default parameters, and GSNAP and TopHat 2 were inputted with known RefSeq transcript definitions and run with novel splice junction finding turned on. The aligned files were converted to bam files using SamTools [64], filtering out reads with Phred quality scores under 20. BEDTools [65] was used to compute genome properties, such as coverageBed to compute genome coverage and RPKM (using a Perl script) and intersectBed to determine peak overlaps. Subsampling was accomplished using Picard's DownSampleSam [66].

Peak-finding was accomplished using MeRIPPeR [35] and transcriptome profile plots were generated using Jenotator: Java Genome Annotator [67]. A custom R script was used for plotting the transcriptome profile plot and Excel 2013 was used to plot the other bar charts. r-make was used to generate genome annotation plots [68]. Motif regions were extracted using ChIPseeqer's ChIPseeqerMotifMatch [69] and individual m6A sites were extrapolated with a Perl script.