Background

Transcriptome profiling analysis is widely used in cancer research and clinical settings, such as drug discovery, diagnosis testing, and molecular biomarker discovery [1,2,3]. Formalin-fixed, paraffin-embedded (FFPE) tissue samples are the most commonly available clinical specimens resource having histopathology data for developing new molecular biomarkers in clinical research [4, 5].

High-quality RNA from fresh biological tissues is optimal to generate reliable transcriptome data. As FFPE samples are highly modified and fragmented with wide ranges of nucleotides, standard mRNA-Seq (poly-A selection) methods for transcriptome analysis are challenging [6, 7]; total RNA-Seq (with rRNA depletion) or RNA exome capture are the preferred methods [8,9,10]. However, total RNA-Seq using FFPE RNA is not generally consistent likely due to variation in RNA quality, with an abundance of intronic, intergenic, and rRNA reads and fewer exonic reads [7, 11]. Subsequently, fewer libraries are multiplexed for sequencing in each lane to yield sufficient reads than in standard mRNA-seq, leading to higher sequencing costs [7, 12]. While the RNA exome capture generates more exonic reads than total RNA-seq, the capture procedure incurs increasing library preparation costs. Recently developed 3′ mRNA-seq methods such as Tag-Seq [13], QuantSeq [14,15,16], and MACE RNA-Seq [17, 18] are now available. All three methods have similar procedures; however, QuantSeq has the most streamlined protocol, and all the reagents for library preparation are included in the kit. MACE RNA-Seq requires poly-A isolation before first stranded cDNA synthesis, while Tag-Seq is not available as a kit. This approach does not require RNA fragmentation before reverse transcription and only detects the 3′ end of the mRNA; thus, it may be used for degraded RNA samples, such as FFPE derived RNA, with a faster turnaround time and lower costs for library preparation and sequencing [19, 20]. 3′ mRNA-seq has been shown to yield data comparable with standard mRNA-seq in high-quality RNA and to be a reliable method for gene expression profiling in FFPE [15, 16, 18, 20]; however, performance in severely degraded FFPE samples has not yet been reported.

This study evaluates 3′ mRNA-Seq using the Lexogen QuantSeq 3′ mRNA-Seq Library Prep FWD Kit with unique molecular identifiers (UMI). The data are compared with TruSeq Stranded mRNA-Seq and RNA Exome Capture kit using Universal Human Reference RNA (UHR). RNA derived from fresh frozen (FF) and FFPE tissues with varying input amounts and nucleotide sizes range were used and compared with Exome Capture. Our results show that severely degraded FFPE RNA may be sequenced yielding accurate transcriptome profiling by 3′ mRNA-seq using UMI.

Results

Figure 1 shows the design of this study. First, we evaluated the performance of Quantseq 3′ mRNA-Seq with UMI using a control RNA, UHR and compared with Tru-Seq stranded mRNA-seq. Next, we used FF and FFPE RNA samples, and severely degraded FFPE. For the latter, we included four replicates to evaluate reproducibility. These data were compared to Exome Capture, which is optimized for FFPE derived RNA. Samples used in this study had DV200 values ranging from 13 to > 70%, with input RNA between 1 ng and 100 ng and data for all samples in the study are included in Supplemental Data S1.

Fig. 1
figure 1

The overall experimental design

QuantSeq 3′ mRNA-Seq performance using UHR and standard input and low input/FFPE protocols

The QuantSeq 3′ mRNA-Seq kit has two protocols, standard input for high-quality RNA (> 10 ng) and low input/FFPE for degraded or small amounts of RNA (≤10 ng). We evaluated reproducibility with these two protocols using UHR. Total mapped reads were similar among the different input amounts and protocols (87–99% from total reads). However, the unique reads after PCR bias correction gradually dropped as total input RNA decreased (Fig. 2A, 56–10%). The total number of detected genes was ~ 15,000 to 22,000 genes (Fig. 2B), with the lower input/FFPE protocol showing fewer detected genes in the lower expressed genes (Fig. 2C). Overall, observed sample correlations were well matched within both protocols (standard input; R > 0.98, low input/FFPE; R > 0.94) and between protocols (R = 0.97, Fig. 2D).

Fig. 2
figure 2

The PCR bias-corrected QuantSeq 3′ mRNA-Seq data in the UHR. A; Percentage of mapped reads out of total reads between standard input and low input/FFPE protocols by different input amounts. B & C; Total number of detected genes in the different inputs between protocols. D; Similarity matrix between the input amounts and protocols. Data were normalized by log2 (TPM + 1)

Comparison between QuantSeq 3′ mRNA-Seq and TruSeq stranded mRNA-Seq on UHR

QuantSeq 3′ mRNA-Seq data using the standard input protocol was compared with Illumina TruSeq Stranded mRNA-Seq kit at 100 ng input level, which is the minimum RNA input amount recommended by Illumina. The correlation between the two protocols was moderate (R = 0.78, Fig. 3A), with the standard mRNA-Seq mapping more exonic region (65% vs. 83%) but fewer intronic region (21% vs. 2%), intergenic region (14% vs. 0.2%), and rRNA (8% vs. 2%, Fig. 3B). Both methods detected a similar number of expressed genes (22,304 and 21,319, Fig. 3C), and 17,003 genes were shared (Fig. 3C). QuantSeq 3′ mRNA-Seq data captured 71% of protein-coding genes from total detected genes and 77% in TruSeq Stranded mRNA-Seq (Fig. 3D).

Fig. 3
figure 3

Data comparison between the QuantSeq 3′ mRNA-Seq and TruSeq Stranded mRNA-Seq kit. A; Correlation plot. Data were normalized by log2 (TPM + 1). Each dot constitutes a gene. B; Distribution of mapped reads. The incompatible paired-end reads (15%) were not reflected in the TruSeq Stranded mRNA-Seq data. C; Number of detected genes between two platforms. D; Percentage of mapped reads distribution by RNA biotypes

Performance of QuantSeq 3′ mRNA-Seq in moderately degraded RNA

Next, we evaluated QuantSeq 3′ mRNA-Seq using degraded RNA derived from FFPE and FF samples having > 30% (38–70%) of DV200 at 10 ng input. Total mapped reads were 83 to 97% but dropped to 13 to 28% after PCR bias correction (Fig. 4A). The total number of detected genes was 11,603 to 17,818 (Fig. 4B). Among the samples, there was one paired set of FF (6) and FFPE (8B) samples, and the agreement was 0.73 and 0.92 at the 1 ng and 10 ng input levels, respectively (Fig. 4C & D).

Fig. 4
figure 4

The PCR bias-corrected QuantSeq 3′ mRNA-Seq data in the degraded RNA (DV200 > 30%). A; Percentage of mapped reads out of total reads. Blue, non-PCR bias-corrected reads; Orange, PCR bias-corrected reads. B; Total number of detected genes. C; Similarity matrix in the paired FF and FFPE samples at 1 ng and 10 ng input. D; Correlation plot at 10 ng input between FF and FFPE samples. Samples 6-FF-70 and 8B-FFPE-70 are paired samples. Data were normalized by log2 (TPM + 1). 6-FF-70, 70% of DV200; 8B-FFPE-70, 70% of DV200;; 2-FFPE-50, 50% of DV200;; 3-FFPE-40, 40% of DV200;; 2-FF-68, 68% of DV200. Each dot constitutes a gene

Application QuantSeq 3′ mRNA-Seq for the severely degraded RNA

To validate the performance of QuantSeq 3′ mRNA-Seq using highly degraded FFPE RNA with ≤30% (13–30%) of DV200 values, input amounts were increased to up to 100 ng to achieve sufficient unique reads after PCR bias correction. The unique reads at 10 ng input ranged from 10 to 17%, increased to ~ 40–50% after increasing the input amount to 100 ng (Fig. 5A). Along with increasing the unique reads, the total number of detected genes increased from 10,316 to 16,999 (Fig. 5B). Overall correlations in the 30% of DV200 FFPE samples were relatively high at the 100 ng input (EF1-FFPE-30, R = 0.92 & GT1-FFPE-30, R = 0.88), while moderate in the 10 ng input (EF1-FFPE-30, R = 0.83 & GT1-FFPE-30, R = 0.82, Fig. 5C & D). Similarly, 13 and 20% of DV 200 FFPE RNA showed good corerlation between samples at a 100 ng input level (EF1-FFPE, R = 0.92, GT1-FFPE, R = 0.87 & 0.90), and moderate correlation in the 10 ng input (EF1-FFPE, R = 0.80 & 0.84, GT1-FFPE, R = 0.77 & 0.79).

Fig. 5
figure 5

The QuantSeq 3′ mRNA-Seq data comparison using highly degraded RNA (DV200 ≤ 30%). A; Percentage of mapped reads out of total reads by different input amounts and average fragment size of RNA. Blue, non-PCR bias-corrected reads; Orange, PCR bias-corrected reads. B; Total number of detected genes in the different inputs and average fragment size of FFPE RNA. C & D; Similarity matrix at the 10 ng and 100 ng input amounts of EF1-FFPE-30 and GT1-FFPE-30. GT1-FFPE-13, 13% of DV200; GT1-FFPE-30, 30% of DV200; EF1-FFPE-20, 20% of DV200; EF1-FFPE-30, 30% of DV200; JB1-FFPE-19, 19% of DV200; 1-FF-20, 20% of DV200

Data comparison between QuantSeq 3′ mRNA-Seq and RNA exome capture kit in the severely degraded FFPE samples

The RNA exome capture method is designed for use with FFPE samples as standard mRNA-seq yields variable results; thus we compared RNA Exome Capture data with QuantSeq 3′ mRNA-Seqdata. Moderate correlation was observed with R = 0.68 (EF1-FFPE-30) and R = 0.67 (GT1-FFPE-30, Fig. 6A). The average exonic reads were 38% in the QuantSeq 3′ mRNA-Seq and 81% in the RNA Exome Capture kit, while intronic reads (44% vs. 3%), intergenic reads (19% vs. 5%) and rRNA reads (4% vs. 0.3%) were higher in QuantSeq 3′ mRNA-Seq than RNA Exome Capture kit (Fig. 6B). Total detected genes by RNA Exome Capture were 14,897 (EF1-FFPE-30) and 15,300 (GT1-FFPE-30), and shared 12,589 (EF1-FFPE-30) and 12,119 (GT1-FFPE-30), respectively. QuantSeq 3′ mRNA-Seq detected 13,075 (EF1-FFPE-30) and 12,498 (GT1-FFPE-30) genes (Fig. 6C).

Fig. 6
figure 6

Data comparison between the QuantSeq 3′ mRNA-Seq and RNA Exome Capture. A; Correlation analysis. Data were normalized by log2 (TPM + 1). Each dot constitutes a gene. B; Distribution of mapped reads. Data are means of EF1-FFPE-30 and GT1-FFPE-30 samples from each kit ± SD.***, p < 0.001; **, p < 0.01. The incompatible paired-end reads (11%) were not reflected in the RNA Exome Capture data. C; Number of detected protein-coding genes between two platforms. EF1-FFPE-30, 30% of DV200; GT1-FFPE-30, 30% of DV200

Discussion

Most mRNA-Seq studies use high-quality RNA from unfixed tissues or cells, and standard mRNA-Seq method is widely employed to investigate underlying biological differences. However, standard mRNA-Seq has a limitation when RNA is degraded with 3′ bias of the data and poor performance of library preparation. Several studies have suggested that a 3′ mRNA-Seq method may be a better option for such samples, as RNA degradation generally starts at the 5′ end [5, 16, 18]. In this study, we evaluated the performance of the QuantSeq 3′ mRNA-Seq using UMI for PCR bias correction to detect accurate gene expression data. Herein, we show 3′ mRNA-Seq using UMI to be an alternative option for the gene expression studies over a wide range of RNA derived from FFPE tissue.

To validate the performance of the QuantSeq 3′ mRNA-Seq with UMI, we first used UHR differing the input amount of RNA. Two protocols are available for QuantSeq 3′ mRNA-Seq, one standard input for higher quality RNA and one low input/FFPE protocol for FFPE derived or small amounts of RNA. Data were highly reproducible between the two methods. As expected, the unique mapped reads after PCR amplification error correction gradually decreased by RNA input amount. As each transcript molecule is barcoded with UMI before PCR amplification, the final data avoid PCR bias; thus, more accurate transcript counts are achievable even with 1 ng input amounts. However, TruSeq mRNA-Seq had better data quality with a higher proportion of exonic reads and less intron/intergenic and rRNA reads from total reads than QuantSeq 3′ mRNA-Seq. This difference may be related to the enrichment of alternative poly-A in the 3′ mRNA-Seq method [12, 21]. Also, it may be affected by the Internal priming of oligo dT primers on homopolymeric regions of transcripts, which generates erroneous reads during the first-strand cDNA generation [12]. Lastly, greater read depth in the TruSeq mRNA-Seq may increase exonic reads, while many 3′ RNA-seq reads correspond to poly-A sequences which when trimmed may also remove shorter reads and thus reduce relevant information [12]. In terms of data agreement, we observed a moderate correlation (R = 0.78), comparable to that reported by others using conventional mRNA-Seq and 3′ mRNA-Seq with UMI [22] or KAPA Stranded mRNA-Seq kit and the Lexogen QuantSeq 3′ mRNA-Seq kit without UMI [16]. This may reflect data differences related to longer transcripts count bias in standard mRNA-Seq and amplification error correction in the 3′ mRNA-Seq [18, 22]. The standard mRNA-Seq method requires a fragmentation step before reverse transcription with random hexamer to make cDNA, leading to more read counts per transcript, particularly from longer transcripts [16, 19, 23]. By contrast, the 3′ mRNA-Seq generates one read per transcript without fragmentation before reverse transcription, and PCR amplification error correction is reflected in the analysis [18].

The unique mapped reads and the total number of detected genes in the FFPE samples were dependent on RNA input, regardless of degradation levels. In this study, even severely degraded FFPE RNA may be used for QuantSeq 3′ mRNA-Seq with at least 100 ng input, and data were highly correlated with even in samples with ≤30% of DV200 values. Previously Turnbull et al. [20] reported more detected genes (25,610) using > 10-year-old FFPE samples, which used 500 ng input, suggesting that input amounts may be a more important factor than degradation level for increasing unique reads on QuantSeq 3′ mRNA-Seq. We observed a high correlation between paired FF and FFPE samples (R = 0.92) at the 10 ng input level. Recently, Boneva et al. [18] reported a high concordance rate between paired FF and FFPE samples (R2 = 0.88) using the MACE-Seq with UMI method at the 1000 ng level. This supports the tenet that 3′ mRNA-Seq method for FFPE samples is a reliable method for gene expression study.

RNA exome capture detects more fusion genes and alternatively spliced genes compared to standard mRNA-Seq and total RNA-Seq in FFPE samples [8, 9, 12]. Also, previous reports showed that gene expression quantification data is comparable with mRNA-Seq in high-quality RNA samples and total RNA-Seq in degraded samples [11, 24]. However, the direct correlation analysis between QuantSeq 3′ mRNA-Seq and RNA Exome Capture kit was not robust in this study. Like the TruSeq Stranded mRNA-Seq data above, data differences may relate to longer transcripts count bias and higher sequencing reads in the RNA Exome Capture and amplification error correction in QuantSeq 3′ mRNA-Seq. Although RNA Exome Capture data showed clear performance advantages over QuantSeq 3′ mRNA-Seq in the total number of genes captured, most of the protein-coding genes detected in the QuantSeq 3′ mRNA-Seq overlapped with RNA Exome Capture data. On the other hand, QuantSeq 3′ mRNA-Seq better quantifies gene expression. As Exome capture targets the coding region only, it generates more information to quantify gene expression [11, 12, 24]. However, compared to QuantSeq 3′ mRNA-Seq, RNA Exome Capture has a longer protocol, and the library preparation includes amplification before and after capture, which may affect data quality, particularly for more lowly expressed genes. Also, it captures only preselected RNAs and is only applicable for human samples [24]. While QuantSeq 3′ mRNA-Seq with UMI has a fast turnaround time, lower read depth but more accurate gene quantification, it reveals alternative poly-A sites, and allows more libraries to be multiplexed for sequencing [12, 16, 18]. Depending on project requirements, increasing read depth may be accomplished by altering multiplexing.

Conclusions

This study evaluated QuantSeq 3′ mRNA-Seq using UMI in high-quality RNA comparing with TruSeq Stranded mRNA-Seq and with RNA Exome Capture using degraded RNA derived from FFPE tissue. We report that QuantSeq 3′ mRNA-Seq with PCR bias correction using UMI is a suitable method for gene quantification in both FF and FFPE RNAs. QuantSeq 3′ mRNA-Seq may be applied to even severely degraded RNA from FFPE tissues, generating high-quality sequencing data. QuantSeq 3′ mRNA-Seq using UMI is one means by which to investigate gene expression in a cost-effective manner, other approaches may yield more information and a greater number of detected genes, alternative splicing, and fusion genes. Thus, investigators should select the most suitable method based on the goals of the experiments and samples’ conditions because each platform has a different chemistry and sensitivity. Albeit, the QuantSeq 3′ mRNA-Seq using the UMI method provides an opportunity, particularly for gene expression analyses in severely degraded specimens, which may have not been feasible for RNA-Seq in the past.

Methods

RNA extraction from FF and FFPE samples

FFPE samples were cut to 10 μm thickness, and several tissue slices were put into a 1.5 ml tube. Xylene was added for deparaffinization, then total RNA was extracted with the Qiagen miRNeasy FFPE kit (Qiagen, CA, USA) following manufacturers’ protocol. Total RNA from fresh frozen (FF) Sample 6 was extracted using TRIzol (Thermo Fisher Scientific, MA, USA) following manufacturers’ protocol. UHR was purchased from ThermoFisher Scientific. Total RNA was quantified by Qubit and qualified by Agilent 2100 BioAnalyzer (Agilent Technologies, CA, USA). DV200 value (the percentage of RNA fragments > 200 nucleotides) was determined by 2100 expert software.

Library generation

There are two protocols for the library preparation for the QuantSeq 3′ mRNA-Seq Library Prep Kit-FWD (Lexogen, Vienna, Austria). For the standard input protocol, UHR was incubated for 15 min at 42 °C to generate first-strand cDNA, and RNA was removed. The UMI second-strand synthesis mix was added to generate second-strand cDNA, followed by purification of double-stranded cDNA, and then PCR, using dual indices with 11 cycles for the library amplification was performed. UHR 10 ng and 1 ng, and all FFPE and FF samples were processed using the low input/FFPE protocol. Most processes are the same as standard input protocol for the low input/FFPE protocol, but incubation was increased to one hour for the first-strand cDNA and PCR was increased to 22 cycles for the library amplification.

For the standard mRNA-Seq library, the TruSeq Stranded mRNA-Seq library kit (Illumina, CA, USA) was used and followed manufactures’ protocol. Briefly, mRNA from 100 ng of UHR was isolated using mRNA isolation beads and fragmented for 4 min at 94 °C. The first-strand cDNA was synthesized at 42 °C, and the second-strand cDNA was synthesized at 16 °C for one hour with a second-strand marking buffer. Double strand cDNA was cleaned using DNA XP beads (Beckman Coulter, IN, USA), then A-tailed, ligated with index, amplified library with 15 cycles, and then the final library was cleaned using DNA XP beads.

For the RNA exome capture library, the TruSeq RNA Exome Capture kit (Illumina, CA, USA) was used and followed manufactures’ protocol. Briefly, 500 ng of highly degraded RNA was used for the first-strand cDNA synthesis at 42 °C. The second-strand cDNA was synthesized at 16 °C for one hour with a second-strand marking buffer. Double strand cDNA was cleanup with DNA XP beads, A-tailed, ligated with index, amplified library with 15 cycles, and then the final library was cleaned with DNA XP beads. cDNA library was quantified using Qubit and Agilent 2100 BioAnalyzer D1000 chip, and 200 ng of each library was pooled for exome enrichment and capture. After finishing the second enrichment, the pooled final libraries were amplified with 10 cycles and then the final library was cleaned using DNA XP beads.

The libraries were quantified by BioAnalizer 2100 system using the D1000 kit (Agilent, CA, USA) and Qubit dsDNA BR Assay kits (Thermo Fisher Scientific, MA, USA). All the libraries were sequenced 101 bp paired-end reads on Illumina HiSeq 4000 or MiSeq.

Data analysis

For the 3′ mRNA-Seq data, ~ 1.5 to 8 million (M) of total reads were generated from each library. The Read 1 FASTQ files were uploaded into Partek Flow software (Partek Inc., MO, USA), and primary QC was performed. The UMI reads were identified, and adapter and poly A/T sequences were trimmed. The STAR (2.6.1d) [25] aligner was used to align reads to the human reference genome (hg38). After alignment, the final BAM files were quantified using the Partek E/M algorithm [26] after deduplicating UMIs by Ensembl annotations (Ensembl Transcripts release 92). For the standard mRNA-Seq and the RNA exome capture data, ~ 30 to 43 M pairs of total reads were generated from each library, and FASTQ files were uploaded into Partek Flow software. After primary QC was performed, the reads were aligned to the human reference genome (hg38) using STAR (2.6.1d) aligner. The final BAM files were quantified using the Partek E/M algorithm by Ensembl annotations (Ensembl Transcripts release 92). The aligned reads were normalized to TPM (Transcripts Per Kilobase Million) values and transformed log2 (TPM + 1) values. Pearson R-value was used for sample correlation analysis after PCR bias-corrected data. Protein-coding genes were used for the comparison between 3′ mRNA-Seq and RNA exome capture method. The two-tailed student’s t- test was used for statistical analyses.