Background

Adenosine-to-inosine (A-to-I) RNA editing is one of the most common RNA modification types in human cells, which greatly diversifies the transcriptome [1]. A-to-I RNA editing is catalyzed by enzymes encoded by the adenosine deaminase acting on RNA (ADAR) gene family in Metazoans [2,3,4,5]. ADAR proteins recognize and bind to double-stranded RNAs (dsRNAs) to deaminate adenosines into inosines [6,7,8,9]. Most RNA editing sites in human cells occur in Alu repeats [10,11,12,13], the most abundant type of short interspersed elements (SINEs).

A-to-I RNA editing occurs in both coding and non-coding regions, with diverse functional roles in human cells [1, 14, 15]. The impact of recoding sites (i.e., those that alter protein-coding sequences) has been relatively well studied, many of which alter protein function [16]. It is now known that RNA editing in non-coding regions can influence gene expression, such as by affecting alternative splicing [17,18,19] or RNA stability [20,21,22]. In addition, RNA editing affects microRNA maturation, leading to the crosstalk between RNA editing and RNA interference [23, 24]. Recently, regulation of the immunogenicity of dsRNAs is emerging as an important aspect of RNA editing function [25,26,27]. Given the diverse functions of RNA editing, abnormal editing patterns have been reported for numerous diseases, such as neurological diseases, autoimmune disorders, and cancers [1, 14, 28,29,30].

Next-generation sequencing technologies, especially RNA-sequencing (RNA-seq), have greatly facilitated the discovery of RNA editing events [31,32,33,34]. To date, more than 16 million RNA editing events have been cataloged in human transcriptomes [35]. In order to segregate RNA editing sites from single-nucleotide polymorphisms (SNPs) in the genome, many previous methods required sequencing of both DNA and RNA of a sample. In our previous work, we developed a method, namely GIREMI, to accurately identify RNA editing events using a single short-read RNA-seq dataset without genome sequencing data of the corresponding sample [36].

With the development of third-generation sequencing (TGS) technologies, long-read RNA-seq methods recently emerged as powerful tools to study RNA biology. Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) are the two main representatives of the TGS platforms. Different from short-read RNA-seq methods, long-read RNA-seq interrogates full-length transcripts without breaking the RNAs into small fragments, thus preserving transcript structures [37]. As a result, long-read RNA-seq overcomes the transcript assembly ambiguities inherent to short-read RNA-seq, greatly improving the understanding of transcriptome diversity [38].

A number of methods have been developed to analyze long-read RNA-seq data, primarily focusing on transcript isoform identification and their abundance analysis [38,39,40,41]. Another application, which is significantly under-explored, is to identify and analyze single-nucleotide variants (SNVs) in the RNA. Identification of SNVs, such as genetic mutations or RNA editing sites, is fundamental to many biomedical questions. In long-read RNA-seq, SNV analysis presents significant challenges, due to the well-known high error rates of the third-generation sequencers.

We present L-GIREMI (long-read GIREMI), a method to identify RNA editing sites in long-read RNA-seq (without the need of genome information). L-GIREMI effectively handles sequencing errors and biases in the reads and uses a model-based approach to score RNA editing sites. L-GIREMI allows investigation of RNA editing patterns of single RNA molecules, co-occurrence of multiple RNA editing events, and detection of allele-specific RNA editing. This method provides new opportunities to study RNA nucleotide variants in long-read RNA-seq.

Results

Overview of the L-GIREMI method

Linkage patterns between alternative alleles of RNA variants in the mRNA differ for different types of RNA variants. For example, a pair of SNPs within the mRNA are generally expected to possess perfect allelic linkage. In contrast, non-genetic RNA variants, such as RNA editing sites, do not generally show significant allelic linkage with each other nor with SNPs (unless allele-specific editing exists). In our previous work, we showed that these properties can be employed to distinguish RNA editing sites from genetic variants using short-read RNA-seq data [36]. Long-read RNA-seq affords a major advantage in capturing such allelic linkage since multiple variants in the same mRNA can be covered by each read. Leveraging this feature of long-read RNA-seq, we developed the L-GIREMI method to identify RNA editing events using this data type.

After a typical read mapping procedure (e.g., via minimap2 [42]), the L-GIREMI algorithm is mainly composed of four steps (Fig. 1). First, the strand of each read was examined and corrected if necessary (see Methods). Second, mismatch sites in the BAM file were obtained and pre-filtered according to common practices in detecting RNA editing sites using RNA-seq data [43, 44]. In the third step, the mutual information (MI) between pairs of mismatch sites in the same gene was calculated. Specifically, an average MI was calculated for each unknown mismatch relative to putative SNPs (from dbSNP) covered by the same reads. Similarly, MI of pairs of putative heterozygous SNPs (from dbSNP) was also obtained. Since most RNA editing events are expected to occur independently of the allelic origin of the mRNA, the above MI for an RNA editing site (relative to SNPs) should be smaller than that between two heterozygous SNPs. Thus, the two types of MI values were compared to predict RNA editing sites among the unknown mismatches. The predicted RNA editing sites then served as training data for the fourth step, where a generalized linear model (GLM) was derived. Sequence features and allelic ratios of candidate sites were included as predictive variables in the GLM and a score was calculated for each mismatch (Methods).

Fig. 1
figure 1

The schematics of the L-GIREMI algorithm (see text for details)

Performance evaluation of L-GIREMI

We first tested the performance of L-GIREMI using a dataset derived from the brain sample of an Alzheimer’s disease (AD) patient (PacBio Sequel II, data available at PacBio, 4,277,293 reads). As expected [37], the majority of reads harbored at least one mismatch or indel (Additional file 1: Fig. S1a). On average, 14 mismatches, 38 deletions, and 11 insertions were found in each read (Additional file 1: Fig. S1b). Thus, the nature of long-read RNA-seq presents substantial challenges in resolving bona fide nucleotide variants.

L-GIREMI overcame these challenges and effectively detected RNA editing sites from the dataset. As shown in Fig. 2a, upon the initial screen of nucleotide variants, all 12 types of single nucleotide mismatches were detected in the mapped reads, with the A-to-G type (likely due to A-to-I editing) constituting only a small fraction. The L-GIREMI built-in filters (Methods) were applied to remove sites possibly arising from sequencing errors (Step 2, Fig. 1), which improved the %A-to-G among all mismatches (Additional file 1: Fig. S2a, Additional file 2: Table S1). Subsequently, MI values were calculated for mismatch sites that shared at least 6 reads with putative heterozygous SNPs (defined as dbSNPs with allelic ratio between 0.35 and 0.65 in the data). As a comparison, the MI values of pairs of putative heterozygous SNPs were also calculated. As shown in Fig. 2b, the MI distribution of unknown mismatches relative to SNPs was well separated from that of pairs of putative heterozygous SNPs. Thus, using the MI distribution of SNPs, we calculated an empirical p value for each mismatch site and identified those with p < 0.05 as candidate RNA editing sites. This step detected a total of 13,442 editing sites, with 83.3% of them being the A-to-G type (Additional file 2: Table S1). These sites were in turn used as training data for the GLM model (Methods). The contribution of different features used in the GLM is shown in Additional file 1: Fig. S2b. In total, 28,584 RNA editing sites were detected in the AD dataset, with 98.1% being A-to-G mismatches (Fig. 2c). The high fraction of A-to-G sites among all predicted editing sites attests to the high accuracy of our method. Interestingly, we observed that the %A-to-G sites among all predicted sites increased monotonically with the GLM score (Fig. 2d). By default, L-GIREMI chooses a score cutoff (vertical line in Fig. 2d) for each dataset to optimize the F1 value (Methods). As an alternative approach, a user-defined GLM score cutoff can be provided to achieve a desired %A-to-G, based on the GLM score vs. %A-to-G relationship.

Fig. 2
figure 2

Identification of RNA editing sites in long-read RNA-seq data of the brain sample of an Alzheimer’s disease (AD) patient. a Raw mismatches detected in the dataset. b Mutual information for pairs of putative heterozygous SNPs (based on dbSNP) or non-dbSNP mismatches relative to putative SNPs. c RNA editing sites identified by L-GIREMI. d %A-to-G among all predicted editing sites vs. GLM score. Dotted line denotes the score cutoff used for c (0.64). e Number of RNA editing sites identified given different read coverages (randomly chosen subsets of the AD dataset). f %A-to-G among the RNA editing sites identified in the subsets in e

In short-read RNA-seq analysis, a standard method to identify RNA editing sites is called the “genome-aware” method, where sample-specific genomic variations were used to segregate RNA editing sites from genomic SNPs. To further evaluate the performance of L-GIREMI, we analyzed the AD dataset using a “pseudo-genome-aware” approach. That is, after the same pre-filtering step as used in L-GIREMI, we predicted RNA editing sites by excluding all known human SNPs from dbSNP (since sample-specific genomic data is not available). We examined the %A-to-G among all predicted sites by both methods and separately for different types of regions (Additional file 2: Table S1). In addition, we included the % of predicted sites that are cataloged in the REDIportal database (%REDIportal) [35]. L-GIREMI outperformed the pseudo-genome-aware method in all types of regions evaluated by both %A-to-G and %REDIportal.

To evaluate the impact of read coverage on the results, we randomly sub-sampled the AD data set to retain different numbers of total reads. As expected, the number of predicted RNA editing sites decreased given reduced read coverage (Fig. 2e). In contrast, the “sensitivity” of the method remained high at low read coverages as evaluated against identifiable REDIportal sites (Additional file 1: Fig. S2c). In addition, the fraction of A-to-G sites among the predicted RNA editing sites (score cutoff determined by F1 score) remained relatively high except for the very low coverage (e.g., 0.4 M reads, Fig. 2f). It should be noted that the fraction of A-to-G among the pre-filtered sites (step 2, Fig. 1) was much lower than that of the final L-GIREMI predictions (Additional file 1: Fig. S2d), supporting that the MI and GLM steps enhanced the prediction accuracy. In summary, L-GIREMI affords high accuracy in capturing RNA editing sites in long reads for a wide range of total read coverages.

Identification of RNA editing sites with data of different PacBio platforms

We analyzed three datasets derived from GM12878 cells by the ENCODE project (ENCODE IDs: ENCFF417VHJ, ENCFF450VAU and ENCFF694DIE, 1,673,768 reads, 2,137,168 reads and 2,538,701 reads respectively) via the PacBio Sequel II platform. The 3 datasets had similar error profiles. However, ENCFF417VHJ, which was built using the MaximaH- reverse transcriptase, showed relatively lower levels of indels and mismatches than the other two datasets (Additional file 1: Fig. S3a, b).

The GLM score vs. %A-to-G curves were largely monotonic, although exceptions existed (Additional file 1: Fig. S3c). The final predicted RNA editing sites for the ENCFF417VHJ dataset had a high level of A-to-G (99.9%), which was much higher than the 31.4% and 37.8% for ENCFF450VAU and ENCFF694DIE respectively (Additional file 1: Fig. S3d) (all of which being higher than the % using only pre-filters, Additional file 1: Fig. S3e). Therefore, the chemistry used in generating PacBio RNA-seq libraries can greatly influence RNA editing identification. Note that for the latter two datasets, a user-defined GLM score cutoff could be used to achieve a higher accuracy (%A-to-G) in the predicted editing sites. Importantly, we noted that similar data derived from GM12878 cells but using the earlier Sequel platform yielded lower quality and suboptimal RNA editing identification (Additional file 1: Fig. S4). Thus, our data suggest Sequel II is a preferred platform for the purpose of RNA editing studies.

In the following sections, we only used the ENCFF417VHJ dataset given its improved quality. Compared to the genome-aware method (i.e., by filtering out GM12878 SNPs cataloged by the Genome-in-a-Bottle project [45]), L-GIREMI yielded higher %A-to-G and %REDIportal in different types of regions (Additional file 2: Table S2). Note that due to the relatively low total coverage of ENCFF417VHJ, the number of predicted editing sites was not high, especially in non-Alu regions. Although the genome-aware method identified many sites in non-Alu regions, a considerable fraction of them may be false positives as reflected by the low %A-to-G and %REDIportal.

Comparison of RNA editing sites identified in short and long reads

Since short-read RNA-seq has been frequently used in RNA editing analyses, we compared the results identified in short- and long-read RNA-seq of GM12878. PolyA-selected cytosolic RNA-seq data generated by the ENCODE project (ID: ENCSR000COR (ENCLB555ANM)) was used for this purpose. The data was randomly down-sampled to retain a total of 30 million short reads, which is approximately equivalent to the base coverage afforded by the long-read data (ENCFF417VHJ), considering read length differences. We used the same “genome-aware” method for short reads as in our previous work [36]. As shown in Additional file 1: Fig. S5a, this method identified more than 4000 editing sites (86% being A-to-G) in the short reads. Compared to the long-read results (Additional file 2: Table S2), short-read RNA-seq appeared to yield higher sensitivity. However, the %A-to-G based on short-read data was relatively low, especially in non-Alu regions (which is partly due to the relatively low total read coverage, known to affect %A-to-G as discussed in our previous work [43]).

We observed that a larger fraction of editing sites from short reads was located in introns, compared to that of long reads (Additional file 1: Fig. S5b). This may also explain the higher number of editing sites detected in short reads as it is known that introns are enriched with Alu elements where the majority of editing sites reside. To conduct further comparisons, we obtained the union of mismatch sites that were testable (with ≥ 6 total reads) in both datasets. A total of 190 editing sites were identified in both, with 339 unique to the long reads and 148 unique to the short reads (Additional file 1: Fig. S5c). Sites unique to one dataset were absent in the other dataset mostly due to lack of edited reads. The %A-to-G of the subsets of sites were high, reflecting high quality predictions. Thus, although less editing sites were identified in the long reads, the two types of modality performed similarly at common testable sites.

Co-occurrence of RNA editing sites

Consistent with previous reports based on short-read RNA-seq data, the majority of RNA editing sites detected by L-GIREMI in the GM12878 (ENCFF417VHJ) and AD datasets were located in non-coding regions and Alu elements (Fig. 3a, b) [31, 33, 37, 46, 47]. A prevailing question is whether multiple editing sites of a gene tend to co-occur in a subset of RNA molecules or if their occurrence is largely independent of each other. This question was challenging to address using short reads [48,49,50]. In contrast, the long-read RNA-seq data enable a direct examination of this question.

Fig. 3
figure 3

Co-occurrence of A-to-I RNA editing sites in Alu elements detected by L-GIREMI. a Genomic context of A-to-I RNA editing sites identified in the AD sample or GM12878 cells. b Number of RNA editing sites in Alu repeats or otherwise. c Cumulative distribution for the Gini index of Alu repeats calculated via read-specific editing ratio. Shuffled data were generated for comparison (Methods), which led to significantly lower Gini index values than the original data (p < 0.001 for both data sets, KS test). d Cumulative distribution of mutual information of pairs of editing sites or pairs of SNPs in the same gene. Compared to the shuffled controls, both editing sites and SNPs show higher levels of linkage (p < 0.001 for all comparisons, KS test) although the latter were associated with much higher mutual information

We first asked whether editing sites in the same Alu was approximately equally distributed among the reads corresponding to the Alu. To this end, we calculated the Gini index for the number of editing sites observed in each read of an Alu. Gini index is a measure of inequality among a given set of values (Methods). As background controls, we shuffled the occurrence of the observed editing sites among all reads of the Alu. As shown in Fig. 3c, the Gini index of the actual editing profiles was much larger than that of the controls, suggesting the existence of co-occurrence of editing sites in the same Alu.

As an alternative method, we examined the MI of pairs of editing sites in a gene and asked whether their values were larger than that of randomly shuffled editing sites. Consistent with the above finding (Fig. 3c), the MI values of pairs of editing sites in the same gene were significantly higher than those of the shuffled controls (Fig. 3d). Nonetheless, it should be noted that the MI of editing sites was substantially lower than those between pairs of SNPs (Fig. 3d), which has been established previously [31, 33, 37, 46, 47]. This observation still holds if only known editing sites (from REDIportal) were used (Additional file 1: Fig. S6). Thus, it is unlikely due to the fact that MI was used in the process of editing site identification. Overall, our results support the existence of co-occurrence of RNA editing sites in the same RNA molecules, to a level significantly higher than random expectations, but significantly lower than genetic linkage.

Allele-specific RNA editing events detected by L-GIREMI

Long-read RNA-seq allows examination of linkage patterns between any type of RNA variants in the same read. One type of linkage event, allele-specific RNA editing, reflects the existence of genetic determinants of RNA editing, which has been shown in human and mouse tissues [51]. However, it is not clear whether allele-specific RNA editing affects a predominant number of editing sites. We examined this question using the long-read RNA-seq of GM12878 since its whole-genome sequencing data is available. Specifically, we calculated the MI values of all known RNA editing sites in the REDIportal database [35] relative to known SNPs in GM12878 that were detectable in the long-read RNA-seq data. Note that the REDIportal sites were used here instead of editing sites identified in this study so that the source of the editing sites was independent of any linkage calculation. As shown in Fig. 4a, the majority of these REDIportal-defined known editing sites had relatively low MI values, with only a small fraction (~ 12%) having MI values greater than 0.3 (the cutoff used in L-GIREMI to identify RNA editing sites in this dataset).

Fig. 4
figure 4

Allele-specific editing reflected in the GM12878 long-read RNA-seq data. a Mutual information of known editing sites (REDIportal). Six sites with high mutual information were randomly chosen for experimental testing (arrows). b Summary of experimental validation results. Four out of 6 sites were confirmed as RNA editing sites. cf Sanger sequencing traces for 4 confirmed editing sites. d IGV plot for an example allele-specific editing event: between the RNA editing site chr1:9102175 (red rectangle) and three heterozygous SNPs (blue rectangle). Note that the gene is on the − strand

The above results suggest that allele-specific editing may only affect a minority of editing sites. To exclude the possibility that apparent allele-specific RNA editing may be largely due to the existence of genetic variants among REDIportal-defined editing sites (i.e., false positives), we tested 6 likely allele-specific editing sites (arrows, Fig. 4a) using Sanger sequencing. Four of these sites were confirmed as RNA editing sites, whereas two of them were neither edited nor SNPs (thus likely sequencing errors) (Fig. 4b–f). Figure 4g shows the reads harboring an example allele-specific editing event. Therefore, our data suggest that allele-specific RNA editing does exist, although relatively rare.

Since L-GIREMI uses MI as its initial step to predict RNA editing sites, this step excludes allele-specific editing sites. However, such sites may still be captured in the scoring step of L-GIREMI where the GLM model is used for prediction (Additional file 1: Fig. S7). In general, L-GIREMI is not recommended for detecting allele-specific editing for novel editing sites. Nonetheless, the MI calculation implemented in L-GIREMI can be used to uncover allele-specific editing of known RNA editing sites.

dsRNA structures likely affect long-read coverage

While inspecting RNA editing sites in the RNA-seq reads, we observed a curious pattern where some long reads showed skipping of a region, often in the vicinity of RNA editing sites. Such skipped regions did not coincide with annotated splicing events. Previous studies observed similar patterns in expression sequence tags [52]. Figure 5a shows an example in the 3′ UTR of the gene MREG. In this example, two Alu repeats are located in the skipped region, where many editing sites were identified. This region folds into a strong dsRNA structure (Fig. 5a).

Fig. 5
figure 5

Long-read RNA-seq detected highly structured regions. a IGV plot for an example where many reads had internal skipping. This region harbors two Alus. The double-strand RNA structure predicted by RNAfold is shown. b Diagram illustrating RT-induced template switching that may induce region-skipping in long reads. c Read coverage of predicted dsRNAs with or without region-skipping patterns. d Histogram of dsRNAs with different fractions of reads that showed region-skipping patterns. e The length of skipped regions within predicted dsRNAs (median = 813, 627 for the AD and GM12878 data, respectively)

We hypothesize that region-skipping in long reads is a consequence of the highly structured nature of the RNA. Indeed, it is known that reverse transcriptase (RT) can generate deletion artifacts in cDNAs, which is caused by intramolecular template switching, an event where RT skips the hairpin structure of the template RNA (illustrated in Fig. 5b) [53]. To further explore this hypothesis, we followed a previously published method to identify dsRNA structures harboring editing-enriched regions [54]. A total of 36,166 and 17,293 predicted dsRNAs were covered by at least 1 read for the AD and GM12878 datasets respectively (Additional file 1: Fig. S8). Among these predicted dsRNAs, about 20% overlapped reads with region-skipping (Fig. 5c). In addition, 98.2% and 97.6% of the skipped regions were located between two Alu repeats for the AD and GM12878 datasets, respectively (Fig. 5c). For 34.4% and 31.6% of the dsRNAs in AD and GM12878 datasets, respectively, the skipping pattern occurred in ≥ 50% of their reads (Fig. 5d). The median length of the skipped region was about 600 to 800 basepairs (bp) (Fig. 5e), which is approximately the length of two adjacent Alu repeats [11].

Discussion

Long-read RNA-seq is a powerful means for transcriptome profiling [38, 40, 41, 55, 56]. A number of methods have been developed to discover and quantify full-length isoforms in long-read RNA-seq data [38, 40, 41, 55, 56]. Specialized methods were also developed to capture the unique signatures of inosines in direct RNA-sequencing of Oxford Nanopore long reads [57]. However, analysis of single-nucleotide variants, such as RNA editing sites, in cDNA long-read RNA-seq presents significant challenges resulting from the relatively high level of sequencing errors. To this end, we present L-GIREMI, a method to identify RNA editing sites using long-read RNA-seq data.

L-GIREMI examines the linkage patterns between sequence variants in the same reads, complemented by a model-driven approach, to predict RNA editing sites. We adopted a similar strategy as in our previous method, GIREMI [36], which focused on short-read RNA-seq data. We showed that L-GIREMI affords high accuracy as reflected by the high fraction of A-to-G sites or known REDIportal sites in its predictions. It considerably outperformed the traditional “genome-aware” methods. Although the “genome-aware” methods are quite effective for short reads RNA-seq data, their performance deteriorates in the presence of sequencing errors in the long reads. In addition, we demonstrated that the performance of L-GIREMI is robust given a wide range of total read coverage. Furthermore, as expected, RNA editing identification depends on the quality of the long-read data, with Sequel II-derived data outperforming those of the Sequel platform.

Long-read data is naturally advantageous in capturing correlative occurrence of multiple nucleotide variants in the mRNA. Compared to short reads where only a limited number of mismatch pairs may be captured in the same reads, long reads can theoretically cover all mismatch pairs in the mRNA. Thus, the MI method applied to long-read RNA-seq is expected to be much more effective than in the case of short reads. However, a challenge still exists with long reads due to the relatively high sequencing error rate. As a result, sequencing errors may lead to false positives in the MI-predicted editing sites. This limitation may explain the relatively low %A-to-G (70–80%) among the editing sites predicted by MI alone (Additional file 2: Table S1, S2), which is in stark contrast to the very high %A-to-G (> 99%) among the MI-predicted editing sites in the short reads [36]. Therefore, for the current long-read datasets, it is necessary to combine MI with the GLM method to boost the performance of L-GIREMI. Since the MI-predicted editing sites were used as training data for the GLM step, inaccuracy in the training data may also affect the performance of the GLM, which is reflected in the relatively small variance explained by some features (neighboring nucleotides or allelic ratio, Additional file 1: Fig. S2b). It should be noted that the mismatch type feature in the GLM explained the most variance (Additional file 1: Fig. S2b). This feature was learned from the MI-predicted editing sites, thus not allowing any user input to bias the mismatch composition in the final predicted editing sites. As a result, L-GIREMI is applicable to datasets where the dominant type of editing is unknown a priori. With future improvements in sequencing error rates and sequencing depth of long reads, we expect the MI method alone will become increasingly sufficient and the GLM may not be necessary. Thus, we provide an option in L-GIREMI to run only the MI step for future applications.

Compared to short reads, long-read RNA-seq yielded less editing sites, especially those in intronic regions. However, this observation may not be universally applicable as various types of library preparation protocols for RNA-seq may generate considerably different distributions of reads in different types of genomic regions. At present, at least for RNA editing analysis, long-read RNA-seq is not yet a strong replacement of short-read RNA-seq, given its relatively high cost and high error rate. A combination of the two modalities will undoubtedly yield enriched information, by not only identifying more editing sites but also providing ways for concurrent analysis of nucleotide and isoform variants.

Long-read RNA-seq allows examination of co-occurrence of RNA editing sites in a single molecule. Leveraging this strength, we showed that editing sites in the same Alu or mRNA co-occurred more often than expected by chance. This observation extends previous reports using short-read RNA-seq that detected clustered RNA editing sites in hyper-edited regions [58]. Many scenarios may lead to co-occurrence of RNA editing sites, for instance, long lifespan of RNA molecules, higher local concentration of ADAR proteins, or synergistic effects [59]. Nonetheless, it is important to note that this level of co-occurrence is much lower than that driven by the linkage patterns of genetic variants on the same haplotype. Thus, the basic rationale of L-GIREMI that distinct MI distributions exist for genetic variants and RNA editing sites still holds.

Similar to RNA editing co-occurrence, long-read RNA-seq also allowed us to evaluate the possible existence of allele-specific editing. Our data supported the infrequent existence of this phenomenon. Allele-specific editing may be caused by SNP-related structural changes in the dsRNA region, demonstrating a mechanism of cis-regulation of RNA editing. The calculation of MI as implemented in L-GIREMI can be used to detect allele-specific editing events.

Another notable observation of our study is that the dsRNA structure may lead to the skipping of a sizable region in the long-read RNA-seq. Previous studies also reported such observations, likely due to reverse transcriptase template switching, thus not unique to PacBio data [38, 53, 60]. This artifact may induce false positive calls of alternative splicing events, if the region-skipping is interpreted as a splicing event. It may also reduce the sensitivity in identifying RNA editing sites that reside in the skipped regions. On the other hand, compared to short-read data, long reads are uniquely advantageous in detecting this phenomenon, which may be leveraged to inform RNA secondary structure predictions in the future. Lastly, we note that RNA editing is an effective proxy to the existence of dsRNA structures. An alternative way to predict dsRNA regions is to search for those that harbor inverted repeat Alu (IRAlu) pairs. Region-skipping was rare in IRAlu regions with low RNA editing levels (Additional file 1: Fig. S9), presumably due to lack of dsRNA structures.

Conclusions

In summary, we present a method for the identification of RNA editing sites in long-read RNA-seq with high accuracy, even given low sequencing depth. Application of L-GIREMI allowed examination of RNA editing sites in single molecules, allele-specific RNA editing, and region-skipping due to existence of dsRNA structures. This method provides a powerful means in examining nucleotide variants in long reads.

Methods

Mapping of reads using minimap2

Minimap2 was used for the mapping of long-read RNA-seq data against the human genome (hg38) [42]. The “--cs” option was included in order to output the cs tags that enabled parsing of sequence variants. Only unique mapped reads were retained for the identification of RNA editing sites. Samtools was used to remove multi-mapped reads with the option “-F 2052” in the samtools view module [61]. Subsequently, the filtered SAM files were sorted according to genomic coordinates and converted into BAM files.

The L-GIREMI analysis steps

The L-GIREMI algorithm consists of four main steps: (1) correction of read orientation, (2) collection of mismatches, (3) calculation of mutual information among mismatch sites, and (4) scoring of mismatches with a generalized linear model (GLM). The details of each step are provided below.

Correction of read orientation

Although most reads generated by PacBio had the correct read strand, a minority of reads had wrong orientations, which can affect the accuracy of nucleotide variant analysis. We adopted the following strategy to check and, if necessary, correct the read strand. First, we obtained 3 types of strand information: the strand of the read from minimap2 (namely, mapped strand), the strand of the gene annotated (Gencode v34) at the read location (namely, annotated strand), and the strand from which the majority of the splice site sequences of the read were consistent with the known motifs (GT-AG, GC-AG or AT-AC) (namely, splice site strand). For most reads, all 3 sources gave the same strand. Otherwise, the read strand was set to be the dominant one among the three. Furthermore, in rare cases, only two types of strand information were available due to the read mapping to intergenic regions or the lack of known splicing motif. For these cases, the annotated strand was given the highest priority, followed by splice site strand and mapped strand, respectively. If the read was mapped to a region with sense and antisense gene pairs, the gene that had a larger overlap with the read sequence was used. It should be noted that reads harboring no spliced junctions were removed from the analysis to avoid contamination by DNA-derived reads.

Collection of mismatches

In this step, we obtain a catalog of all mismatches in the mapped reads, detected via the cs tags generated by minimap2. Their genomic coordinates, reference nucleotides, and alternative nucleotides were stored. In order to remove likely sequencing errors, several filters were implemented according to common practice in RNA editing identification via RNA-seq [44]. Specifically, the following mismatch sites were removed: (1) those with low read coverage (< 3 reads), (2) sites within simple repeats or homopolymers, (3) sites within the vicinity (default 4 bp) of splice sites, and (4) sites with allelic ratio (minor allele/total) less than a threshold (default 0.1). Mismatched sites overlapping dbSNP annotations (v38) were labeled as SNPs and those with allelic ratio between a range (0.35 to 0.65 by default) were labeled as heterozygous SNPs, which were used in the calculation of mutual information.

Calculation of mutual information among mismatch sites

L-GIREMI calculates the MI for each mismatch site relative to heterozygous SNPs, and for pairs of heterozygous SNPs, respectively. For a mismatch (mi) and a SNP (sj), the MI (I) was calculated as:

$$I\left(m_i,s_j\right)=\sum_{n_i \in N_1}\sum_{n_j \in N_2}p\left(n_i,n_j\right)\log(\frac{p\left(n_i,n_j\right)}{p\left(n_i\right)p\left(n_j\right)})$$

where N1 = {the two most frequent nucleotides for site 1}, N2 = {the two most frequent nucleotides for site 2}, and ni and nj denote the nucleotides of the sites in the pair. p(ni, nj) represents the probability of observing the ni and nj nucleotides in the same read, calculated using the maximum likelihood method. Natural logarithm was used for the calculation. Only the two most frequent alleles at each site were used. The MI of a pair of SNPs was calculated similarly.

For every mismatch or SNP site, there might exist multiple other SNPs harbored within the same reads. Thus, the overall MI for the mismatch or SNP was calculated as the average of all the pairwise MI values. Empirical p values were calculated for all the mismatches based on the distribution of the SNP MI. Mismatches with p value smaller than a threshold (0.05 by default) and not included in the dbSNP database were selected as predicted RNA editing sites, which were used for GLM training described below.

Scoring of mismatches via a GLM

GLM was used as the scoring model of L-GIREMI. Features included in the model are the allelic ratio of the mismatch, the mismatch type, and the sequences of the nearest neighbor nucleotides before and after the mismatch site. RNA editing sites identified in the above MI calculation step were used as positive training data. dbSNPs with p values larger than the threshold (0.05 by default) in the MI calculation step were used as negative training data. The training and testing of the GLM was carried out similarly as in GIREMI [36]. The score of each mismatch was calculated using the GLM. A score cutoff was chosen to maximize the F1 value. By default, predicted editing sites were defined as those that passed the score cutoff. This cutoff is dataset-specific. Alternatively, the user can define a customized score cutoff to achieve a desired level of %A-to-G among the predicted editing sites. As noted in the Results, for datasets with acceptable quality, the %A-to-G increased nearly monotonically relative to the GLM scores. Note that the MI-predicted editing sites were also scored by the GLM and only those that passed the score cutoff were retained.

Calculation of the Gini index of Alu editing

The Gini index was calculated for each Alu using the editing ratio of each read. For each Alu, all possible editing sites were identified using all the reads that covered the Alu. Then, for each read, the fraction of possible editing sites that were edited in this read was calculated, referred to as the editing ratio of the read. The editing ratios of all reads for each Alu were then used to calculate the Gini index (G). In order to speed up the calculation, the Gini index was calculated as half of the mean absolute difference normalized by the mean of editing ratios [62]:

$$G=\frac{{\sum }_{i=1}^{n}{\sum }_{j=1}^{n}\left|{x}_{i}-{x}_{j}\right|}{2{\sum }_{i=1}^{n}{\sum }_{j=1}^{n}{x}_{j}}=\frac{{\sum }_{i=1}^{n}{\sum }_{j=1}^{n}\left|{x}_{i}-{x}_{j}\right|}{2{n}^{2}\overline{x}}$$

where xi and xj are the relative editing ratios of the read i and j, respectively, and \(\overline{x}\) is the mean of all the editing ratios. n is the total number of reads for the Alu. This calculation was also carried out for shuffled data (based on randomization of As and Gs at possible editing sites across reads).

Identification of RNA editing sites in short-read RNA-seq data

We obtained polyA-selected cytosolic short-read RNA-seq data for GM12878 from the ENCODE project (ID: ENCSR000COR (ENCLB555ANM)). The RNA-seq reads were aligned to the human reference genome (hg38) with HISAT2 [63]. We then identified RNA editing sites using the “genome-aware” strategy, where pre-filtering and a SNP filter were applied [36]. Downsampling of the original mapped BAM file was carried out to achieve a similar sequencing depth as the long-read data to enable a fair comparison (with read length accounted for; about 30 million reads retained for the short-read RNA-seq).

Double-stranded RNA prediction

We applied a previously developed method to identify long dsRNAs harboring editing-enriched regions (EERs) [54]. This method is based on the rationale that the observation of A-to-I editing in an endogenous RNA is proof that the RNA is double stranded in vivo because ADARs edit dsRNAs. Briefly, we identified EERs using known RNA editing sites [64]. Overlapping 50 bp windows (each with ≥ 3 editing sites) were combined and EERs within 1 kb were classified as one EER. This distance allows formation of structures by distant binding partners. RNAfold was used to fold the EERs. Filters on minimum free energy and mismatch patterns were implemented to retain dsRNAs with > 200 bp stem length.

Experimental validation of allele-specific editing via Sanger sequencing

Genomic DNA (gDNA) and total RNA were extracted from GM12878 cells using the Quick-DNA™ Miniprep kit (Zymo Research) and the Direct-zol™ RNA Miniprep Plus kit (Zymo Research), respectively, following the manufacturer’s protocols. Two micrograms of the total RNA was used to generate cDNA using the SuperScript™ IV First-Strand Synthesis System (Invitrogen). Sequences + / − 100 bp flanking the mismatch site were amplified using the DreamTaq PCR Master Mix (2X) (Thermo Scientific). The primers used for each amplicon are provided in Additional file 2: Table S3. The PCR amplicons were resolved in 1% agarose gel and the bands of desired sizes were cut out and purified using the Zymoclean™ Gel DNA Recovery Kit (Zymo Research). Following purification, the amplicons were sent for Sanger sequencing (GENEWIZ from Azenta) using one of the PCR primers. Sites with alternative alleles in both gDNA and cDNA were validated as SNPs. Those with both A and G in the cDNA but only A in the gDNA were validated as RNA editing sites.