Background

High-throughput sequencing of long (> 100 nt) or small (18–50 nt) RNA provides a quantitative measure of RNA abundance. However, RNA-seq and small RNA-seq library construction can introduce bias at multiple steps, such as fragmentation of long RNAs, reverse transcription, adapter ligation, library amplification by PCR, and sequencing. Commonly used high-throughput sequencing platforms, including those made by Illumina and Pacific Biosciences, require PCR amplification during library construction to increase the number of cDNA molecules to an amount sufficient for sequencing. However, PCR stochastically introduces errors that can propagate to later cycles [1, 2]. PCR also amplifies different molecules with unequal probabilities [1]. PCR duplicates are reads that are made from the same original cDNA molecule via PCR.

A common practice to eliminate PCR duplicates is to remove all but one read of identical sequences, assuming that such reads have been created from the same cDNA molecule by PCR [3]. This assumption may be flawed, especially with ever higher sequencing throughput, which increases the chance of observing reads with identical sequences but from different cDNA molecules. The situation is further exacerbated for small genomes and for techniques that interrogate a subspace of the genome. For example, the majority of small RNA-seq reads are microRNAs (miRNAs) or PIWI-interacting RNAs (piRNAs), which derive from loci that amount to just a few percent of the genome [4,5,6,7]. The assumption also has systematic biases. For example, a shorter gene is more likely to give rise to identical RNA-seq reads than a longer gene with the same transcript level, simply because the “genomic space” for the random process of RNA fragmentation is smaller for the shorter gene. Finally, the conventional definition of PCR duplicates is based on mapping coordinates—reads mapping to the exact same genomic location are considered to have identical sequences. However, many small RNAs with the same sequence can be produced from multiple genomic loci; thus, strategies using genome mapping to identify PCR duplicates ignore the situation that identical reads arise from distinct sites in the genome.

Standard library preparation and sequencing procedures typically have pre-specified PCR and sequencing error rates, but parameters such as the amount of starting RNA used to generate a library, the number of reads sequenced (i.e., sequencing depth), and the number of PCR cycles used are often adjusted to accommodate sample source, abundance, and quality. While the notion that more PCR amplification increases artefactual duplicate reads in high-throughput sequencing makes intuitive sense and is widely accepted, high PCR cycle numbers are often necessitated by scarce starting materials, another likely cause for duplicate reads. Thus, the contribution of PCR cycle number to PCR duplicates is often confounded with the contributions of starting materials and sequence depth.

Unique molecular identifiers (UMIs) are often used to accurately detect PCR duplicates and quantify transcript abundance [8,9,10,11,12,13,14,15,16]. If each molecule in the starting pool is barcoded with a UMI, i.e., all molecules are unique, then reads with the same UMI must be PCR duplicates. In practice, only the molecules in the starting pool that have identical sequences need to have different UMIs.

One strategy to incorporate UMIs introduces pre-defined, manually-selected sequences into the adapters. This strategy can avoid UMIs with suboptimal GC content and minimize complementarity between or within UMI sequences [13]. Because UMI identities are unambiguously defined, sequencing and PCR errors can be easily corrected. However, implementing pre-defined UMIs requires a large number of costly, custom-synthesized oligonucleotides.

An alternative strategy employs adapters that contain random nucleotides at certain positions in the adapters. The combinations of the random-nucleotide positions lead to an exponential number of different UMIs at almost no extra cost, because incorporating a random nucleotide costs the same as incorporating a specific nucleotide during DNA synthesis. UMIs bearing either five (45 = 1,024 unique barcodes) or ten random nucleotides (410 = 1,048,576 unique barcodes) were implemented cost-effectively and shown to improve PCR duplicate removal [8, 14]. A higher number of unique combinations can be achieved simply by increasing the number of random-nucleotide positions. The number of UMI combinations must be sufficiently large because as mentioned above, the chance that two cDNA molecules with identical sequences in the starting pool are tagged with the same UMI combination needs to be infinitesimally small.

Here, we describe novel experimental protocols and computational methods to unambiguously identify PCR duplicates in RNA-seq and small RNA-seq data. We show that removing PCR duplicates using UMI information is accurate, whereas removing PCR duplicates without UMIs is overly aggressive, eliminating many biologically meaningful reads. Finally, we show that the amount of starting materials and sequencing depth determine the level of PCR duplicates, without additional contribution from the extent of PCR amplification.

Results

Adapting standard RNA-seq procedures to incorporate UMIs

To incorporate UMIs into RNA-seq, we modified a published, strand-specific, library construction protocol [17]. The original method has proved to be robust and time-efficient, and the adapter ligation step uses DNA adapter oligonucleotides that can be readily synthesized at a low cost [7, 18,19,20]. The standard protocol uses a single Y-shaped DNA adapter comprising two partially complementary oligonucleotides and an unpaired 3′ thymidine that pairs with the single adenine tail added to both ends of the double-stranded cDNA fragments. We modified the adapters by inserting a five-nucleotide random UMI (Fig. 1a, b). Consequently, each cDNA fragment is ligated to an adapter with a UMI at each end, randomly choosing one out of 1,048,576 (45 × 45) possible combinations provided by two UMIs.

Fig. 1
figure 1

UMI incorporation into RNA-seq. a Overall workflow. Schematic of a read produced from RNA-seq with UMIs (b) and of UMI locators (c)

Our UMI RNA-seq adapters were designed so that the sequencing reaction begins at the very first nucleotide of the 5′ UMI (Fig. 1b). The random nucleotides of UMIs offer the sequence diversity in the initial five sequencing cycles. This sequence diversity is critical for commonly used Illumina sequencing platforms, such as HiSeq, MiSeq, and NextSeq, to generate base-calling templates and make accurate models for discriminating read clusters [21, 22]. To avoid insertions or deletions within or flanking a UMI, albeit rare, from altering the UMI identity, we further designed a “UMI locator”, a pre-defined trinucleotide 3′ to the UMI (e.g. 5′–NNNNNATC–3′). The trinucleotide serves as an anchor allowing unambiguous identification of each UMI (Fig. 1b). Taking the properties of our sequencing instrument of choice—NextSeq 500—into consideration, the 3 nt UMI locator sequence and the mandatory thymidine required for ligation that immediately follows (Fig. 1b) corresponded to the sequencing cycles 6–9, after the first five critical cycles required by the instrument for template generation [23]. However, NextSeq still deemed these four invariant positions of low complexity and reported low-quality sequencing data. Previous approaches to tackle this problem include increasing the diversity of the initial sequences in the library, mixing the library with a high diversity sample (spike-in), lowering sequencing cluster density, or any combination of the above [21]. We designed three UMI locator sequences (Fig. 1c), and, by pooling adapters with one of these sequences at equimolar amounts, we were able to resolve the low complexity problem. Using this approach, we generated RNA-seq libraries from mouse brain, heart, kidney, liver, lung, muscle, spleen, and testis total RNAs. The libraries were sequenced at a read depth, coverage, and quality comparable to libraries generated using the original protocol without UMIs (Additional file 1). Thus, our method of incorporating UMIs, as well as UMI locator, does not interfere with library preparation and sequencing. We subsequently observed that even two different UMI locator sequences sufficed to overcome the erroneous low-quality calling by NextSeq (small RNA-seq, Fig. 2).

Fig. 2
figure 2

UMI incorporation into small RNA-seq. a Overall workflow. The method uses a 3′ adapter composed of DNA, except for a single, 5′ ribonucleotide (rA); the 5′ adapter is entirely RNA. A standard index barcode allows multiplexing. b Schematic of a read produced from small RNA-seq with UMIs

Adapting standard small RNA-seq protocol to incorporate UMIs

Previously, we established a reliable and robust small RNA-seq protocol by modifying a published method which utilizes oligonucleotides compatible with Illumina sequencing platforms [24, 25]. Compared to UMI RNA-seq, incorporation of UMIs into this small RNA-seq protocol requires additional considerations. First, the number of distinct UMI combinations needs to be significantly greater than what is required for RNA-seq. For example, millions of piRNA species—an abundant class of small RNAs in the animal germ line—can be routinely detected in a single individual, and it is estimated that there can be as many as 1 million distinct piRNA molecules in a single spermatocyte or round spermatid [4, 6, 26, 27]. The most abundant piRNA species in this study has 42,281 reads. In the soma, the most abundant miRNA can take up > 40% of the total sequencing depth [28] —tens of millions of reads. Such enormous abundance requires a sufficiently high number of UMI combinations to capture all distinct sequences. Second, the length range of small RNAs (< 50 nt) plus a longer UMI is still well within the read length achievable by common sequencing instruments. Third, the length of a small RNA is a defining feature of its identity and thus, insertions or deletions could lead to misclassification of small RNAs. The second and third considerations also indicate that small RNA-seq is ideally suited for the testing of a large combination of UMIs.

We tested UMIs containing 10 consecutive random nucleotides. Although both the 3′ and 5′ adapters containing 10-nt UMIs ligated to small RNAs with nearly the same efficiency as the original adapters without UMIs, the resulting small RNA-seq libraries yielded unexpectedly short, variable-length reads that contained truncated insert and adapter sequences (data not shown). We speculate that long stretches of random nucleotides interfere with oligonucleotide annealing, a critical step in cDNA synthesis, PCR, and sequencing, by increasing the chance that a primer anneals to a UMI instead of its target sequences. Inter- and intramolecular annealing of 10 nt UMIs may also contribute to truncated reads.

To avoid a long stretch of random nucleotides, we used the UMI locator strategy described above to space out several short stretches of random nucleotides. For each adapter, we designed three trinucleotide UMI sequences, each separated from another by a trinucleotide UMI locator (e.g., 5′–NNN-CGA-NNN-TAC-NNN–3′; Fig. 2a, b). Two adapters with such UMIs can produce a trillion combinations, which should suffice all deep-sequencing applications. Similar to our RNA-seq strategy, we designed adapters with two different sets of UMI locator sequences—mixed at equimolar—to increase the sequence complexity in the early sequencing cycles. This strategy allowed us to successfully generate and sequence the UMI small RNA-seq libraries, unambiguously locate UMIs, and computationally remove reads containing insertions or deletions in UMIs due to reverse transcription, PCR, and sequencing errors (Fig. 2b). We tested our method using total RNAs extracted from mouse testes isolated 17.5 days after birth. To assess the impact of the amount of starting materials on PCR duplicates, we prepared small RNA-seq libraries using a range of 39–5,000 ng RNAs made from serial dilution. To test the effect of PCR cycles, we gradually increased the PCR cycles for each library with a two-cycle increment. The resulting UMI small RNA-seq libraries yielded high-quality sequencing data, comparable to those generated with the original non-UMI protocol (Additional file 1).

Diverse UMIs capture all read species in RNA-seq and small RNA-seq

As mentioned above, to accurately identify PCR duplicates using UMIs, it is critical that the number of distinct UMIs far exceeds the maximal number of starting molecules with identical sequences, such that these molecules have an infinitesimal probability of being ligated to adapters with the same UMI. Previous UMI methods were designed for sequencing single cells or an organism with a less complex transcriptome than mammals [10, 13]. In particular, testis has a higher-complexity transcriptome than many other tissues such as muscle, liver, and even brain [29], demanding a large number of UMI combinations. Our UMI RNA-seq protocol theoretically provides ~ 1 million (410) distinct combinations, and we asked whether this diversity far exceeded the maximal number of reads with identical sequences in our libraries. Indeed, the transcripts derived from the 299-bp 7S RNA 1 gene produce 19,271 identical reads mapping to the same genomic coordinate, all of which are attached to distinct UMI sequences, indicating that all of these reads were from different starting RNA molecules. In conclusion, our UMI RNA-seq protocol is more than sufficient to disambiguate biologically identical reads from PCR duplicates.

Our UMI small RNA-seq provides an even higher number of possible combinations with 18 nt UMIs—68.7 billion (418)—much larger than the number of reads currently produced by a sequencing run. The most abundant small RNA species in our datasets is a piRNA with 42,281 reads, far fewer than the number of UMI combinations our protocol provides. We conclude that the UMI lengths used in the RNA-seq and small RNA-seq protocols contain a sufficient UMI diversity for current and, most likely, future sequencing experiments.

Error-correction for UMIs only slightly improves PCR duplicate identification

To test whether UMIs could accurately identify PCR duplicates, we first evaluated their performance using simulated data. Assuming a library has sufficiently diverse UMI sequences, the simplest way to determine biologically identical reads is to look for reads with the same sequence but are tagged by different UMIs. This approach assumes that there is no error in the replication or reading of the UMI sequences, since such errors could render identical UMI sequences different and vice versa, causing misidentification of PCR duplicates. UMI errors could occur during PCR sequencing, and computationally correcting these errors has been shown to improve identification of PCR duplicates [14, 16, 30,31,32].

We designed a strategy for correcting UMI errors with the following considerations in mind. First, UMI errors are rare, with rates stipulated by the chemistry of PCR and sequencing (~ 10− 5 and ~ 10− 3 errors per position respectively) [33,34,35,36]. Second, when two sufficiently long UMIs (for example, 10 and 18 nt in this study) that differ by just one base are connected to two reads with identical sequences, the probability that these are PCR duplicates of the same UMI with an error, albeit low (p < 10− 3) is still much higher than the probability that these are two distinct UMIs (p = 4− 10 for RNA-seq and 4− 18 for small RNA-seq in this study). Adopting an error-correction method previously developed for RNA-seq [16], we built a UMI graph for each group of reads (Fig. 3a). For RNA-seq, the reads that map to the same genomic position form a group. This approach does not work for small RNAs, because they often originate from multiple genomic loci. Thus, we simply defined a group of small RNA reads as those with identical sequences. In both the RNA-seq and small RNA-seq UMI graphs, a node denotes a unique UMI and further holds the number of reads with that UMI (Fig. 3a). For each pair of UMIs (say, UMI a and UMI b) that differ by just one base (one edit distance apart), we connect their nodes if na ≥ 2 × nb − 1, where na and nb represent read counts for the two UMIs. We require a twofold difference between na and nb, because as described above, the error rates for PCR and sequencing are low, and the twofold differences corresponds to the most extreme case whereby an error occurred during the first PCR cycle. However, a twofold difference is too stringent for pairs of UMIs with low read counts (e.g., 1 versus 2), for which the error predominantly arose from sequencing. We therefore added “− 1” to ensure that these UMIs could be connected. All connected UMIs are then assumed to originate from the most abundant UMIs in the graph. This scheme allows correction of two or more errors in UMIs, provided that the intermediate UMIs are observed (for example, the intermediate UMI with one error and UMI with two errors in Fig. 3ab). One could relax the stringency of this method by adding direct connections between two nodes that differ in two or more positions.

Fig. 3
figure 3

Identifying PCR duplicates. a Strategy for correcting errors in UMIs. b Illustration of how correcting errors in UMIs increases accuracy of PCR duplicate elimination

The need for error-correction might depend on the experimental conditions, including the PCR amplification probability, PCR and sequencing error rates, UMI length, number of initial molecules, number of sequenced molecules, and number of PCR cycles. We performed computer simulations to investigate the effects of these seven experimental conditions on UMI error correction by systematically varying one variable at a time while holding the other six constant. Each round of simulation produced a known number of PCR duplicates and therefore, unlike experimental data, the true fraction of all reads corresponding to PCR duplicates can be determined in the simulated data. To assess the accuracy of PCR duplicate identification using UMIs, we calculated the difference between the number of reads after PCR duplicate removal (“estimate”) and the true value (“truth”) relative to the true value: (estimate − truth)/truth. This metric reflects the extent to which UMIs over- or underestimate the truth as a fraction of the true value. We started the simulation with 100 initial molecules. We then performed PCR by randomly assigning a probability to each molecule (tagged with an 18 nt UMI) to be duplicated in each PCR cycle. The probability follows a uniform distribution between m and 1, where m denotes minimum amplification probability (it can be any value between 0 and 1 and is set to 0.8 in the baseline condition). Minimum amplification probability can be interpreted as PCR efficiency, because the efficiency (average probability) that a molecule is doubled during each PCR cycle is (1-m)/2. Ten cycles of PCR (PCR error rate set to 3 × 10− 5) [33,34,35] generated a pool of 61,000 ± 1,000 (mean ± S.D.) molecules. To test the effect of sequencing depth, we randomly drew 100 molecules from the pool for sequencing (sequencing error rate set to 10− 3) [36] (Fig. 4; Additional file 2: Figure S1). We call this set of parameters “baseline condition”, and it forms the base line from which we systematically varied each parameter. For each condition, we performed 10,000 trials.

Fig. 4
figure 4

Simulation of PCR duplicate removal with or without error correction for UMIs. One parameter (PCR cycle number, starting material, or sequencing depth) was varied with the other parameters kept constant. Upper plots show the fraction of duplicates, while lower plots show the accuracy of duplicate detection. Each dotted line indicates the value for this parameter used in other simulations

We first assumed that there was no error in UMIs (Fig. 3a) and found that on average, (estimate – truth)/truth = 2.10% across 10,000 trials under the baseline condition. Thus, without performing UMI error correction, we slightly overestimated the total number of biological molecules as an error in a UMI would artificially create an extra UMI, and in turn, we slightly underestimated the fraction of PCR duplicates (red vs gray lines in Fig. 4; Additional file 2: Figure S1). Next, we used the UMI graph approach described above (Fig. 3a, b) for correcting errors in UMIs, and the new average of (estimate – truth)/truth = 0.0388%. Even though correcting UMI errors consistently gives better (estimate – truth)/truth than not correcting the errors, the absolute difference in the fractions of PCR duplicates between the two approaches is small (Fig. 4; Additional file 2: Figure S1). For example, under the baseline condition, the true fraction of duplicates was 37.8 ± 3.2%; without correcting UMI errors yielded 36.5 ± 3.3%, and correcting UMI errors gave 37.8 ± 3.2%.

Under some extreme conditions, correcting UMI errors yields substantially better results. For example, if we modify PCR error rate in the baseline condition from the default 3 × 10− 5 [33,34,35] to 10− 3, correcting UMI errors still yields a fraction of duplicates (37.2 ± 3.2%) very close to the truth (37.2 ± 3.1%), while not correcting the errors underestimates the fraction of duplicates (32.1 ± 3.5%). In conclusion, error-correction for UMIs consistently, albeit slightly, improves PCR duplicate identification. Therefore, we performed error correction for all following analyses.

Removing PCR duplicates without using UMIs is fundamentally flawed

Does the practice of removing PCR duplicates without UMIs improve the quantification of both long and short transcripts and in particular, of small RNAs such as microRNAs or piRNAs, which collectively originate from a small portion of the genome? We compared PCR duplicate identification using UMIs together with mapping coordinates of the reads to the conventional approach of using coordinates alone.

When only mapping coordinates were used (RNA-seq data from eight mouse tissues) (Additional file 1), 16.4–44.5% RNA-seq reads were determined to be PCR duplicates, whereas using UMI information in conjunction with coordinates identified only 1.89–10.67% as duplicates. That is, the majority of reads mapping to identical coordinates were in fact not PCR duplicates but rather from distinct starting molecules that should be counted for transcript abundance. The situation is even worse for small RNA-seq data, when only small RNA sequences were used, the majority (56.0–76.8%) of reads were flagged as PCR duplicates and therefore excluded from analysis. In contrast, when UMI information was used together with the sequences of reads, just 1.05–13.6% of reads were determined to be duplicates. Thus, most of the identical reads in RNA-seq and small RNA-seq are biologically real and not PCR duplicates, consistent with the view that small RNAs, which tend to come from precisely the same small genomic regions, can easily be mistaken for PCR duplicates when UMI information is not used. Moreover, the assumption that common mapping coordinates indicate PCR duplicates becomes increasingly problematic as sequencing depth increases, because the chance of observing two identical reads that legitimately derive from different molecules before PCR also increases.

We further tested whether PCR duplicate removal using only mapping coordinates is appropriate for transcript quantification (Fig. 5a). The conventional method underestimated the abundance of 119 transcripts by 1.25 fold or more: removing PCR duplicates based only on coordinates is too aggressive. These 119 transcripts are significantly shorter (median length = 602 nt) and more highly expressed (median abundance = 200 FPKM) than the other transcripts (median length = 1,620 nt; median abundance = 13.2 FPKM; Wilcoxon rank sum test p values = 2.22 × 10− 44 and 1.80 × 10− 59, respectively) (Fig. 5b). Thus, overestimation of PCR duplicates without UMIs reflects (1) a higher tendency of short transcripts to produce identical fragments due to more limited possibilities in fragmentation, and (2) a higher tendency of highly expressed genes to produce identical fragments. We conclude that removing PCR duplicates in RNA-seq data solely by mapping coordinates introduces substantial bias for short or highly expressed genes and that UMIs allow more accurate quantification of PCR duplicates and transcript abundance.

Fig. 5
figure 5

a Transcript abundance (FPKM) calculated by removing PCR duplicates using only mapping coordinates compared to using mapping coordinates and UMIs. b Using only mapping coordinates significantly biases against abundant and short genes. Outliers omitted. Wilcoxon rank sum test; n, number of genes in each group. c Relationship between cumulative coefficient of variation and transcript abundance

UMIs improve data reproducibility

One metric for evaluating the quality of experimental data is the reproducibility between technical replicates. We evaluated how UMIs affect the reproducibility of transcript quantification using five libraries generated using the same sample of total mouse testis RNA, but with gradually decreasing amounts of starting RNA and correspondingly increasing numbers of PCR cycles: 4 μg (8 PCR cycles), 2 μg (9 PCR cycles), 1 μg (10 PCR cycles), 500 ng (11 PCR cycles), 125 ng (13 PCR cycles) (Additional file 1). We then analyzed the data sets treating PCR duplicates using one of three approaches: (1) no PCR duplicates were removed; (2) PCR duplicates were removed using the conventional approach of identical genomic locations; and (3) PCR duplicates were removed using UMIs together with mapping coordinates. We compared the three approaches by calculating coefficients of variation (CV = S.D./mean) for transcript abundance across the five RNA-seq libraries. Compared to removing no duplicates, removing duplicates according to their mapping coordinates decreased the total CV by 5.80% (from 4,210 to 3,960), while using UMIs with mapping coordinates decreased the total CV by 6.67% (from 4,210 to 3,930) (Fig. 5c). For example, when two RNA-seq libraries (125 ng with 12 PCR cycles and 1 μg with 10 PCR cycles) were compared, the number of transcripts whose abundance differed by ≥25% decreased when duplicates were removed (1,880 without duplicate removal, 1,503 removing duplicates by genomic coordinates, and 1,415 removing duplicates using UMIs). We conclude that removing PCR duplicates in RNA-seq data using mapping coordinates alone improves the overall precision of transcript quantification for all genes, and removing PCR duplicates by combining coordinates and UMIs further improves the precision, albeit slightly.

Next, we evaluated the performance of these three approaches for a series of small RNA-seq libraries (starting material 39–5,000 ng). Compared to removing no duplicates, using UMIs to remove duplicates decreased the total CV by 8.72% (Fig. 5c). Surprisingly, removing duplicates according to their mapping coordinates alone increased CV by 79.1% (from 6,490 to 11,620) (Fig. 5c). For example, between two small RNA-seq libraries in this series, one generated from 150 ng and the other from 1 μg of the same total RNA sample, genomic loci (piRNA genes and GENCODE-annotated genes) whose small RNA abundance differed by ≥25% decreased 8.30% when duplicates were removed using UMIs (from 2,613 to 2,396 genes). In contrast, when duplicates were removed using solely mapping coordinates, the number of such irreproducible genes increased by 159% (6,762 genes). These results show that removing PCR duplicates with UMIs leads to more consistent quantification across libraries, whereas removing duplicates without UMIs is overly aggressive and decreases the reproducibility of small RNA-seq experiments.

PCR cycles alone do not determine the frequency of PCR duplicates

One might think that the number of PCR cycles used to amplify the initial cDNA is the major cause of PCR duplicates in sequencing libraries [37]. We sought to test this and to identify other experimental contributing factors. As described above, we performed computer simulations to test the impact of UMI error correction on PCR duplicate detection. We considered seven parameters that could impact the level of PCR duplicates during an RNA-seq or small RNA-seq experiment. Assuming that we have performed UMI error correction, we now examine in detail these seven parameters for their impact on the level of PCR duplicates.

Four of the parameters—PCR amplification efficiency, PCR error rate, sequencing error rate, and UMI length—are specified by the experimental reagents and sequencing platform and typically not adjusted from experiment to experiment. Our simulation results indicate that varying the sequencing error rate, the PCR error rate, or the UMI length around their default values in the baseline condition (i.e., within the ranges stipulated by experimental settings) did not have a significant effect on the fraction of PCR duplicates (the blue line is flat around the dashed vertical line in Additional file 2: Figure S1A–C, top panels). In comparison, PCR efficiency had a measurable effect (the blue line in the top panel of Additional file 2: Figure S1D reveals a negative correlation with PCR efficiency). This is because that at lower PCR efficiency, some molecules are less likely to be amplified and become underrepresented, causing a decrease in library complexity and correspondingly higher fractions of PCR duplicates.

The other three parameters—the number of initial molecules, the number of molecules sequenced (i.e., sequencing depth), and the number of PCR cycles—are often adjusted to meet specific experimental conditions. Our simulations revealed that a change in PCR cycle number alone only minimally affected the fraction of PCR duplicates (the blue line in the top-left panel of Fig. 4 is nearly flat around the dashed vertical line), because the starting molecules of the original pool are proportionally propagated to the final library [38]. In contrast, decreasing the number of initial molecules or increasing the number of molecules sequenced sharply raised the frequency of PCR duplicates (Fig. 4, two top-right panels).

We further tested these findings using experimental datasets. We first analyzed a set of five UMI RNA-seq libraries made with gradually decreasing amounts of starting RNA and correspondingly increasing numbers of PCR cycles: 4 μg (8 cycles), 2 μg (9 cycles), 1 μg (10 cycles), 500 ng (11 cycles), 125 ng (13 cycles) (Additional file 1). We observed that less starting RNA and correspondingly more PCR amplification resulted in higher fractions of PCR duplicates (Fig. 6a). For example, the 125 ng, 13-cycle library yielded 10.7% (median over 43,432 genes) PCR duplicates, while the 4 μg, 8-cycle library made by the same procedure contained only 1.79% PCR duplicates. Similarly, analysis of UMI small RNA-seq libraries generated from 39 ng (30 cycles) to 5 μg (16 cycles) total RNA (Additional file 1) revealed that starting with less RNA caused higher fractions of PCR duplicates (Fig. 6a).

Fig. 6
figure 6

Fraction of PCR duplicates across genes for (a) a series of UMI RNA-seq and small RNA-seq libraries made with different amount of starting materials, and (b) a series of UMI small RNA-seq libraries all made with 5 μg of total mouse testis RNA and with an increasing number of PCR cycles

Simulations argue that the increase in PCR duplicates is not a consequence of greater PCR amplification but rather is caused by the use of lower starting material. To test this idea, we analyzed a second set of nine UMI small RNA-seq libraries, all generated from 5 μg total RNA from the same mouse testis, but amplified using 14 to 30 PCR cycles (Additional file 1). Consistent with the simulations, these libraries did not show a discernable trend between fraction of PCR duplicates and the number of PCR cycles (Fig. 6b). Thus, the higher fraction of PCR duplicates observed in libraries made from low amounts of RNA followed by high PCR cycle numbers more likely reflects the reduced complexity of the starting pool, rather than the increased number of PCR cycles. Together, our simulated and experimental data demonstrate that less starting RNA or higher sequencing depth, but not more PCR cycles per se, accounts for the frequency of PCR duplicates. Nonetheless, our recommendation remains to optimize PCR cycle numbers during library preparation because several other artifacts are known to be associated with excessive PCR amplification, such as decreasing amplification efficiency as PCR reaction proceeds [39], biased amplification of sequences with different GC content [40], or chimeric PCR products due to cross hybridization [41, 42]. Moreover, when the starting material has low complexity, as for single-cell sequencing, excessive PCR amplification is likely to compound all such problems.

Discussion

The two most widely used computational tools for PCR duplicate removal, Picard MarkDuplicates (http://broadinstitute.github.io/picard/) and SAMtools rmdup [3] rely only on the mapping coordinates of sequencing reads. Our data suggest that most identical reads reflect biological reality. Thus, removing PCR duplicate reads using only mapping coordinates erroneously eliminates many usable reads, particularly those produced from short transcripts and small RNAs. This is consistent with an earlier study that compared RNA-seq data generated using three different library preparation methods (Smart-Seq, TruSeq and UMI-seq) [43] and reported that removing PCR duplicates without UMIs introduced bias into RNA-seq data. There are also computational methods that assess PCR duplicate rates without UMIs, e.g., using reads mapping to heterozygous variants [44] or gene expression levels [45], but they only estimate duplicate frequencies and cannot accurately identify duplicate reads. Our approach relies on UMIs and can accurately identify all PCR duplicates, regardless of gene expression and genomic variants.

The eight mouse tissues we analyzed span a range of transcriptome complexity: previous analyses showed that the mouse testis transcriptome contains ~ 18,700 autosomal protein-coding transcripts, ~ 8,600 non-coding RNAs, and ~ 31.7 Mb of intergenic RNA, while the liver transcriptome contains only ~ 15,500 autosomal protein-coding transcripts, ~ 1,000 non-coding RNAs, and ~ 7.2 Mb of intergenic RNA [29]. Among the eight mouse tissues we tested, removing duplicate reads based on only mapping coordinates eliminates many biologically meaningful reads even when the libraries were made using ample starting RNA and optimal experimental conditions. Given the anti-correlation between RNA complexity and PCR duplicate occurrence, UMIs will improve the accuracy of comparing long or small RNA abundance across different tissues or cell types. Short RNAs, such as miRNAs and piRNAs, as well as highly abundant transcripts are particularly susceptible to underestimation by the conventional mapping coordinate method of PCR duplicate removal.

We tested the importance of a key aspect of data processing required for error correction using UMIs and showed that under typical experimental conditions for bulk sequencing (Fig. 4, dotted lines; Additional file 2: Figure S1), correcting or not correcting errors in the UMI sequences has little absolute effect on PCR duplicate quantification. However, sequencing libraries made from a small number of cells or from a small amount of tissue or RNA, have become increasingly common [46], and they are more severely affected by PCR duplicates. Single-cell sequencing poses three specific challenges for PCR duplicate removal. First, it uses a limited amount of starting RNA, causing low library complexity. Second, the ongoing discovery of new species of non-coding RNAs, many poorly understood, increases the number of species being measured, requiring longer UMIs. Finally, the increasingly high sequencing depth provided by advances in technology increases both the number of species that can be detected and the background noise. Together, these three factors make PCR duplicate measurement without UMI error correction especially problematic for single-cell sequencing. Indeed, a number of protocols have been developed to incorporate UMIs for single-cell sequencing [47]. We compile all UMI methods known to us in Additional file 3, including bulk and single-cell, RNA-seq and small RNA-seq protocols.

Compared to other UMI protocols (Additional file 3), our method has several novel features. First, our UMI RNA-seq protocol fragments the RNA before random hexamer priming of cDNA synthesis. It captures both poly(A) + and poly(A) − RNA while limiting 3′-end bias. In comparison, template-switching cDNA synthesis is commonly used in single-cell sequencing protocols; it captures full length cDNAs more efficiently but is also prone to significant 3′-end bias and loss of poly(A) − RNA species due to oligo(dT) priming [14, 48, 49]. Second, our UMI RNA-seq protocol has yielded high-quality data from 125 ng starting material and our UMI small RNA-seq protocol from 39 ng. Our protocols do not require pre-amplification and perform fragmentation prior to cDNA synthesis. Consequently, these methods capture all PCR duplicates. Third, our protocol can accommodate a large number of UMIs: random sequences with 5✕2 nt for RNA-seq and 9✕2 nt for small RNA-seq. Thus, our protocol can be used to study RNA with high sequence diversity: e.g. piRNA in the germline. In our hands long stretches of random nucleotides interferes with library construction and sequencing, whereas our use and placement of UMI locators improves sequencing quality. With the increasing use of small amounts of starting materials, our protocol complements other approaches (Additional file 3).

Our UMI approach builds on well-established protocols, requiring few changes in the procedures and little additional cost. We expect UMI analysis to be particularly useful when sequencing RNAs derived from a limited number of genomic loci, such as CaptureSeq [50] and CAGE-seq [51]. Our approach can theoretically be adapted to any sequencing technique using synthetic oligonucleotide adapters. For example, chromatin immunoprecipitation sequencing (ChIP-seq) and its alternative CUT&RUN survey the genomic regions bound by proteins of interest [52, 53]. The CUT&RUN method uses a nuclease to achieve more precise chromatin cleavage than the conventional ChIP-seq procedure, which utilizes sonication to randomly shear the DNA. Consequently, CUT&RUN increases the likelihood of identical but meaningful reads. By their nature, protein-bound fragments also map to a smaller portion of genomic positions than RNA-seq reads. UMIs can improve discovery of protein binding sites by minimizing noise. Similarly, degradome sequencing profiles the 5′ ends of 3′ cleaved RNA products [54]; incorporating UMIs will enable precise quantification of cleaved RNA abundance.

Conclusions

We described experimental protocols and computational methods that, by incorporating UMIs into standard procedures, allow accurate PCR duplicate removal from RNA-seq and small RNA-seq data. Our approach increases reproducibility and decreases noise in sequencing libraries generated using a broad range of starting RNA amount and number of PCR cycles, enabling accurate quantification of the abundance of both long and short RNAs.

Methods

Animals

C57BL/6 J mice were sacrificed using CO2 and cervical dislocation and tissues were collected from mice maintained and used according to guidelines approved by the Institutional Animal Care and Use Committee of the University of Massachusetts Medical School (A-2222-17).

Ribosomal RNA depletion for RNA-seq

Total RNA was extracted from tissues using the mirVana kit (ThermoFisher Scientific, Waltham, MA, USA) following manufacturer’s instructions. The ribosomal RNA depletion method was adapted from previously published protocols for human samples [55, 56]. One hundred and eighty-six 50 nt-long DNA oligonucleotides complementary to the entire sequences of mouse 18S, 28S, 5S, and 5.8S rRNAs, and mitochondria 16S rRNA and 16S rRNA precursor were used at 0.5 μM (f.c.) for each oligonucleotide. Total mouse testis RNA was isolated with mirVana and incubated with 1 μL of the DNA oligonucleotide mixture per 1 μg, and rRNA oligonucleotide hybridization buffer (100 mM Tris-Cl pH 7.4, 200 mM NaCl) was added to make up to 10 μL. Oligonucleotide hybridization was carried out by heating the sample at 95 °C for 3 min, then slowly cooling it down (− 0.1 °C/second) to 22 °C in a thermocycler. The reaction was further incubated at 22 °C for 5 min before being placed on ice. Thermostable RNase H (Lucigen, Middleton, MA) was added (5 U/μg total RNA), and the reaction adjusted to 50 mM Tris-Cl pH 7.4, 100 mM NaCl, and 20 mM MgCl2. and incubated at 45 °C for 30 min. After DNase treatment with Turbo DNase (1 μL/μg total RNA) according to the manufacturer’s instructions, the rRNA-depleted RNA was purified using RNA Clean & Concentrator-5 (Zymo Research, Irvine, CA, USA).

RNA-seq library construction

RNA-seq library generation was similar to previously published [17], except for the use of a more effective rRNA deletion method (above) and the UMI adapters. Briefly, total RNA was randomly fragmented using heat (94 °C for 5 min) and Mg2+. First strand cDNA was generated using ribosomal-depleted, fragmented total RNA with Superscript III (Invitrogen, Carlsbad, CA, USA). cDNA then underwent dUTP incorporation, end repair, and A-tailing. Resulting cDNA was incubated with a mixture of three sets of UMI-containing adapters, each carrying a distinct consensus sequence as described in the results for adapter ligation (Fig. 1b–c). The ligated cDNA was treated with UDG (New England Biolabs, Ipswich, MA, USA) and amplified by PCR. Length distribution and quality of the resulting library was analyzed by Agilent 2100 Bioanalyzer (Agilent Genomics, Santa Clara, MA, USA). The libraries were quantified using KAPA library quantification kit (KAPA Biosystems, Wilmington, MA, USA) and sequenced using NextSeq 500 (Illumina, San Diego, CA, USA) paired-end sequencing.

Small RNA-seq library construction

Small RNA-seq library preparation was similar to previously described [24, 25] with updated modifications. Briefly, 18–35 nt mouse small RNA was purified from a 15% denaturing urea-polyacrylamide gel (National Diagnostics, Atlanta, GA, USA). Small RNA was first ligated to 3′ DNA adapters with adenylated 5′ and dideoxycytosine-blocked 3′ ends in 50 mM Tris-HCl (pH 7.5), 10 mM MgCl2, 10 mM DTT, and 50% PEG 8000 (New England Biolabs, Ipswich, MA, USA) with T4 Rnl2tr K227Q (homemade) at 25 °C for 16 h. The 3′ adapter contained UMIs in 3 nt-blocks of random nucleotides separated by pre-defined 3 nt consensus sequences (NNN-GTC-NNN-TAG-NNN, Fig. 2b). The 54–71 nt (18–35 nt small RNA + 36 nt 3′ UMI adapter) 3′ ligated product was purified from a 10% denaturing urea-polyacrylamide gel. The 3′ ligated product was then ligated to a mixed pool of equimolar amount of 5′ RNA adapters containing UMIs in 3 nt-blocks of random nucleotides and one of the two distinct consensus sequence sets (NNN-CGA-NNN-UAC-NNN and NNN-AUC-NNN-AGU-NNN) in 50 mM Tris-HCl (pH 7.8), 10 mM MgCl2, 10 mM DTT, 1 mM ATP with T4 RNA ligase (Ambion, Foster City, CA, USA) at 25 °C for 2 h. The ligated product was precipitated with ethanol, and cDNA synthesis was performed using AMV reverse transcriptase (New England Biolabs, Ipswich, MA, USA). cDNA was PCR-amplified with a common forward primer (5′–AAT GAT ACG GCG ACC ACC GAC AGG TTC AGA GTT CTA CAG TCC GA-3′) and a reverse primer containing 6 nt Illumina multiplexing barcode (5′–CAA GCA GAA GAC GGC ATA CGA GAT NNN NNN GTG ACT GGA GTT CCT TGG CAC CCG AGA ATT CCA–3′) using AccuPrime Pfx DNA polymerase (ThermoFisher, Waltham, MA, USA). Finally, the PCR product was purified from a 2% Certified Ultra Low Range agarose gel (Bio-Rad Laboratories, Hercules, CA, USA). Length distribution and quality of the resulting libraries was analyzed by Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA). The libraries were quantified using the KAPA library quantification kit (KAPA Biosystems, Wilmington, MA, USA) and sequenced using NextSeq 500 single-end sequencing (Illumina, San Diego, CA, USA).

Bioinformatics

Simulation procedure was performed according to [16]. Briefly, we simulated 7 parameters: PCR amplification probability, PCR and sequencing error rates, UMI length, number of initial molecules, number of sequenced molecules, and PCR cycle numbers, by varying one parameter and keeping other parameters constant. For each combination of the 7 parameters, we performed 10,000 replicates. UMI error correction was implemented as described in [16], except that we used read sequences instead of genomic coordinates when determining PCR duplicates for small RNA-seq. We used NetworkX [57] for graph-related algorithms, and pysam (https://github.com/pysam-developers/pysam) for handling SAM/BAM files. Reads were mapped to the mouse mm10 genome as described in [58]. For each UMI RNA-seq read, the UMI locator was required to be a perfect match of GGG, TCA, or ATC. For each UMI small RNA-seq read, the 5′ UMI locator was required to match NNNCGANNNTACNNN or NNNATCNNNAGTNNN and the 3′ UMI locator was required to match NNNGTCNNNTAGNNN, allowing 1 error across all of the non-N positions. Reads not meeting these requirements were discarded. UMIs in RNA-seq data were identified using umitools reformat_fastq. Subsequently, PCR duplicates were marked using umitools mark_duplicates. UMIs in small RNA-seq were identified and removed using umitools reformat_sra_fastq. When reads were analyzed without UMIs, PCR duplicates were identified using MarkDuplicates from Picard (https://github.com/broadinstitute/picard).