NGmerge: merging paired-end reads via novel empirically-derived models of sequencing errors
Advances in Illumina DNA sequencing technology have produced longer paired-end reads that increasingly have sequence overlaps. These reads can be merged into a single read that spans the full length of the original DNA fragment, allowing for error correction and accurate determination of read coverage. Extant merging programs utilize simplistic or unverified models for the selection of bases and quality scores for the overlapping region of merged reads.
We first examined the baseline quality score - error rate relationship using sequence reads derived from PhiX. In contrast to numerous published reports, we found that the quality scores produced by Illumina were not substantially inflated above the theoretical values, once the reference genome was corrected for unreported sequence variants. The PhiX reads were then used to create empirical models of sequencing errors in overlapping regions of paired-end reads, and these models were incorporated into a novel merging program, NGmerge. We demonstrate that NGmerge corrects errors and ambiguous bases better than other merging programs, and that it assigns quality scores for merged bases that accurately reflect the error rates. Our results also show that, contrary to published analyses, the sequencing errors of paired-end reads are not independent.
We provide a free and open-source program, NGmerge, that performs better than existing read merging programs. NGmerge is available on GitHub (https://github.com/harvardinformatics/NGmerge) under the MIT License; it is written in C and supported on Linux.
KeywordsHigh-throughput sequencing Illumina paired-end sequencing Read merging sequencing errors; quality scores; PhiX
Polymerase chain reaction
Sequence Read Archive
Among the high-throughput DNA sequencing technologies, the Solexa/Illumina platform  produces the greatest quantity of sequence data in a single run . One unique attribute of this technology is its ability to generate sequence reads from both ends of a given DNA molecule. This provides many opportunities for biological interpretation; for example, one can infer the full extent of a DNA molecule without sequencing its entirety, by aligning the paired-end reads to a reference sequence.
Thus, a base with a quality score of 40 should have a 1 in 10,000 chance of being wrong. In sequence variant detection, many programs will consider only bases that achieve a minimum quality score, in order to reduce false positives , and in clinical variant detection, such as cancer diagnostics, published guidelines frequently incorporate such standards [6, 7]. However, numerous studies have shown that the raw quality scores produced by Illumina machines are inflated; that is, sequenced bases with a given quality score have higher error rates than expected from Eq. (1), especially at the high end of the scale [8, 9, 10, 11].
Advances in Illumina sequencing technology have given rise to reads of increasing length, such that the paired-end reads for a particular library may have substantial sequence overlaps. Since these overlapping regions do not represent independent sequence data, it is possible to merge the reads into a single read spanning the full length of the original DNA fragment. This merging process allows for error correction and accurate determination of read coverage, and it has become increasingly appreciated in applications ranging from targeted variant resequencing  to metabarcoding (e.g., 16S/18S rRNA studies) .
Thus, PEAR sums quality scores for matching bases. VSEARCH utilizes a more sophisticated model developed by Edgar and Flyvbjerg , but the resulting scheme for matching bases is nearly identical to that of PEAR (see Additional file 1: Figure S1). For example, a merged base created from matching bases with quality scores of 40 would be assigned a quality score of 40 by fastq-join, 80 by PEAR, and 85 by VSEARCH (ignoring the artificial caps on quality scores placed by the programs).
None of these quality score profiles has been tested empirically, despite possible shortcomings. For example, the profiles of PEAR and VSEARCH are based on Eq. (1), which, as noted above, has been demonstrated to be inaccurate with Illumina sequencing. Furthermore, both PEAR and VSEARCH were designed under the assumption that sequencing errors in the two reads are independent; that is, in the analysis above, Eq. (3) follows from Eq. (2) only if the two events (“R1 base wrong” and “R2 base wrong”) are independent. This assumption has, to our knowledge, never been verified.
In this manuscript, we first evaluate the baseline quality score - error rate relationship produced by Illumina machines using reads derived from the enterobacteria phage ΦX174 (“PhiX”). This virus’ genome was the first DNA genome to be sequenced , and a library composed of fragments of PhiX DNA is routinely added to Illumina sequencing runs as a control. In addition to the wide accessibility of PhiX reads, the sizes of the fragments in this library are such that most PhiX-derived read pairs produced by longer Illumina runs will have sufficient overlaps that can be used to create quality score profiles for merged bases (Additional file 1: Figure S2). We have done this and incorporated these profiles into a novel merging program, NGmerge (Fig. 1). We demonstrate that NGmerge corrects errors and ambiguous bases (Ns) better than other merging programs, and produces merged reads whose quality scores accurately reflect the bases’ error rates.
Baseline error rates
We began with nine Illumina sequencing runs that yielded 2 × 250 bp paired-end reads, produced at Harvard University. After identifying reads that originated from PhiX, we calculated error rates for each quality score. Consistent with previous studies [8, 9, 10, 11], the error rates were higher than expected based on Eq. (1) above; bases with quality score 40 had error rates an order of magnitude above the predicted 1 × 10− 4.
Variants in the PhiX genome
Creating quality score profiles
On the other hand, where the two reads’ bases disagreed (Fig. 3b), the combined quality scores were lowest when the two original quality scores were similar to each other, and rose above 30 only when the two quality scores were at opposite ends of the scale. This is similar to the schemes of fastq-join and VSEARCH, whereas PEAR does not reduce quality scores for mismatches (Additional file 1: Figure S1).
These two quality score models were incorporated into our merging program, NGmerge, to be utilized in the creation of merged reads.
Comparing merging programs
To compare the performance of NGmerge against other merging programs, we opted not to use the same Harvard datasets, since they had trained NGmerge’s models. Instead, we queried the Sequence Read Archive (SRA) for datasets that had sufficient read lengths and PhiX content to be usable for calculating error rates. We found 33 such datasets, containing a total of 1386 sequencing runs (see Methods and Additional file 2: Table S1 for details).
First, we determined that each of the SRA datasets had the same five PhiX genomic sequence variants previously identified (Table 1). In addition, the baseline error rates of the datasets’ reads followed a similar trend to those of the training datasets. We then processed the datasets through each of the merging programs (since neither fastq-join nor VSEARCH consider dovetailed alignments, the datasets were preprocessed by NGmerge in adapter-removal mode (Fig. 1b) prior to analysis with those programs).
In addition to errors, sequence reads sometimes contain ambiguous bases (Ns), which can also complicate downstream analyses. Though ambiguous bases comprised just ~ 0.03% of overlapping bases in the SRA datasets, NGmerge’s unique approach (counting them as neither matches nor mismatches during alignment) led to the correction of the most, twice the counts of fastq-join and VSEARCH, and 25% more than PEAR (Fig. 4b).
Where the original reads’ bases did not match (Fig. 5b), NGmerge slightly underestimated the quality scores throughout most of the score range. Fastq-join and VSEARCH followed similar paths, going above the baseline profile only at the ends of the quality score range. The merged bases produced by PEAR had far higher error rates than expected throughout the quality score range.
Of the merging programs analyzed, NGmerge has the best performance. It considers dovetailed alignments and thus does not require a separate adapter-removal step prior to merging reads; this more than compensates for its slightly worse run-time compared to VSEARCH (Additional file 1: Note S1). Furthermore, NGmerge produces lower error rates and corrects more Ns than the other programs. We note that NGmerge’s method for resolving mismatched bases may be improved by implementing a context-based scheme like that of CASPER. However, such an approach may have difficulty distinguishing sequencing errors from true biological variants in real samples; this is an area for further research.
In addition, NGmerge creates merged reads whose quality scores accurately reflect the bases’ error rates, unlike the other merging programs. It is noteworthy that the quality scores that deviate the most from the expected error rates are produced by VSEARCH and PEAR in merging matching bases (Fig. 5a). As explained above, these programs’ quality score calculations are based on the assumption of the independence of sequencing errors in paired reads [17, 20]. Our results demonstrate that this assumption is false. Therefore, the models produced by Edgar and Flyvbjerg  are invalid.
Because of our reliance on reads derived from PhiX, NGmerge’s quality score profiles were tested only on datasets generated by MiSeq and HiSeq instruments. Thus, they may not work as well with other Illumina platforms, such as the NextSeq. NGmerge provides the option to forgo its default quality score profiles and instead to utilize calculations similar to those of fastq-join (and FLASH), which, though simplistic, are conservative over most of the score ranges. A third option is for the user to supply custom matrices of quality score profiles to NGmerge.
Although a number of studies have concluded that Illumina’s quality scores are substantially inflated, our results contradict this notion. Inaccuracies in reference sequences are a persistent problem that adversely affect error rate calculations , and in fact that proved to be the case here. Once the PhiX reference genome was corrected to account for the five sequence variants, the calculated error rates closely followed the expected relationship shown in Eq. (1). It is important to note that, in general, errors occurring during the library preparation process (e.g. PCR amplification) can be misconstrued as sequencing errors, leading to specious conclusions . This is another reason why unamplified PhiX remains an enduring control in Illumina sequencing applications.
We have examined errors produced by Illumina sequencing technology via reads derived from PhiX. We have found that variants from the canonical PhiX reference genome account for most of the discrepancy between the actual and theoretical relationships between quality scores and error rates. Furthermore, in the course of developing empirical models for error rates of paired-end sequence reads, we have demonstrated the fallacy of the assumption that has been repeatedly made, both implicitly and explicitly, that errors in such reads are independent.
Finally, we have described a free and open-source program, NGmerge, that merges paired-end sequence reads, thus correcting errors and ambiguous bases, and assigning quality scores that are consistent with the measured error rates. The program can also be run in an alternative mode simply to remove contaminating sequencing adapters. Complete descriptions of the usage and options of NGmerge are found on the homepage of the software (https://github.com/harvardinformatics/NGmerge) and in the accompanying UserGuide. The program is written in C and is parallelized with OpenMP 4.0.
NGmerge operates on paired-end reads in two distinct modes, “stitch” and “adapter-removal” (Fig. 1). In either mode, NGmerge tests all possible gapless alignments of a pair of reads in attempting to find an optimal alignment. By default, NGmerge requires that a valid alignment have a minimum overlap of 20 bp and a maximum of 10% mismatches in the overlap region (-m 20 -p 0.1). If multiple valid alignments are found, the one with the lowest fraction mismatch is selected as the optimum. In all of these calculations, ambiguous bases (Ns) are considered neither matches nor mismatches. When the ‘-d’ option is set, or in adapter-removal mode, NGmerge will also attempt to align the reads in a dovetailed configuration (such as that shown in Fig. 1b), with 3′ overhangs corresponding to contaminating sequencing adapters that will be removed.
In stitch mode, NGmerge forms a single merged read that spans between the 5′ ends of the two original reads. The bases and quality scores of any non-overlapping regions are copied into the new read. For the overlapping region, if the bases of the R1 and R2 reads match, that base is used for the merged read, with the quality score determined from the “match” matrix (see “Harvard datasets” below). Where the bases disagree, the base with the higher quality score is selected, and the “mismatch” matrix yields the merged quality score.
NGmerge (v0.2) was run with default alignment parameters, requiring a minimum overlap of 20 bp and a maximum of 10% mismatches. With the SRA datasets, it was also run allowing 20% mismatches (-p 0.2). The ‘-d’ option was set to allow for dovetailed alignments and the automatic removal of sequencing adapters.
Fastq-join (v1.01.759)  was run with alignment parameters analogous to those of the defaults of NGmerge (-m 20 -p 10). Because fastq-join does not allow for dovetailed alignments, adapters were removed from the reads with NGmerge prior to analysis with fastq-join.
VSEARCH (v2.6.2) , like fastq-join, does not consider dovetailed alignments, so it was also given reads from which adapters were removed with NGmerge. The minimum overlap length was increased to 20 bp (--fastq_minovlen 20). The maximum number of mismatches was greatly increased (--fastq_maxdiffs 30); even so, VSEARCH still analyzed the fewest reads (Additional file 2: Table S1). The cap on output quality scores was increased from the default value of 41 (--fastq_qmaxout 85).
PEAR (v0.9.10)  was run with a 20 bp minimum overlap (-v 20) and a maximum p-value threshold of 0.0001 (there is no fraction mismatch parameter). The cap on output quality scores was increased from the default value of 40 (-c 80).
With NGmerge, the arguments ‘-j <file> -b’ were specified so that the program would produce a file listing overlap mismatches and Ns, for later error counting. With the other merging programs, a custom Python script (findDiffs.py) reconstructed the alignments and determined the overlap mismatches.
Further details of these programs’ approaches toward read merging, along with an illustrative example, are provided in Additional file 1: Note S2.
Calculation of error rates
The 5386-bp genome of the enterobacteria phage ΦX174, sensu lato, was retrieved from NCBI (accession NC_001422.1). The reads of each of the datasets were aligned to this genome using Bowtie2 , as described below. Pileup files were created from the alignment files using SAMtools (v1.5) mpileup (-B -Q 0 -d 1e9) , and variants were called with VarScan (v2.4.1) pileup2snp (--min-var-freq 0.15) .
The downloaded PhiX genome was modified to incorporate the five variants observed in the datasets (Table 1). Furthermore, because the PhiX genome is circular, a fragment corresponding to the first 1 kb (including the variants at positions 587 and 833) was appended to the end of the genome. This produced a final reference genome of 6386 bp that was used in all further analyses.
Reads were aligned to the modified PhiX reference genome using Bowtie2 (v2.3.2). The parameters of the program were modified to increase the strictness of accepted alignments, specifically by increasing the minimum score threshold (--score-min L,0,-0.2) and increasing gap penalties (--rdg 5,15 --rfg 5,15). The size of allowed fragments was increased to 1 kb (-X 1000), and the effort parameters were adjusted (--very-sensitive).
For the analyses of unmerged paired-end reads, contaminating sequencing adapters were removed by NGmerge (‘-a’ mode) prior to alignment. Only properly-paired alignments (‘samtools view -f 0x2’) were used to calculate error rates.
Error rates were calculated by quality score for the alignments in the SAM alignment files by a custom Python script (countErrors.py). When analyzing SAM files of merged reads, the script was provided the original read length(s) and a list of merging mismatches and Ns, in order to further categorize the errors based on the nucleotides in the R1 and R2 original reads, into matches, mismatches, and Ns (Additional file 1: Figure S3). The list of mismatches and Ns was produced by NGmerge or findDiffs.py, as described above.
In order to create the error profiles of NGmerge, we used the script countErrors3D.py, which tallied errors based on both original reads’ quality scores.
The three custom Python scripts are freely available on GitHub (https://github.com/jsh58/NGmerge/tree/master/scripts/).
Nine sequencing datasets produced by the Bauer Core Facility of Harvard University, Faculty of Arts and Sciences Division of Science, between January 2016 and May 2017 were analyzed. Each sequencing run was produced on the Illumina HiSeq 2500 platform, yielding 2 × 250 bp paired-end reads. The reads placed into the “undetermined” bins were examined, a total of 553.0 million read pairs.
The paired-end reads were aligned to the modified PhiX genome after adapter-trimming with NGmerge, as described above. A LOESS regression function relating the quality scores to the logarithm (base 10) of the error rates was calculated in R (v3.4.1). This formed the baseline error profile for subsequent analyses.
To create the quality score profiles of NGmerge, the same reads were processed with NGmerge in stitch mode, allowing dovetailed alignments (-d). The merged reads were aligned to PhiX, and error rates were calculated for each combination of the quality scores of the R1 and R2 reads with countErrors3D.py. The “match” table, consisting of error rates for the locations where the bases of the two reads agreed, was edited to exclude values derived from fewer than 1000 counts, and values of zero were given a pseudo-error count of 0.5 (yielding a rate of 0.5/base count). These same edits were made to the “mismatch” table (error rates where the bases of the two reads disagreed), except that the minimum count threshold was lowered to 100 because of the reduced number of counts. Then, for each table, a two-dimensional LOESS regression function (relating both quality scores to the log (base 10) of the error rates) and predicted error rates were calculated in R. These error rates were then transformed back into quality scores using the baseline error profile calculated for the original paired-end reads. The resulting “match” and “mismatch” matrices were incorporated into NGmerge as the default quality score profiles.
The Sequence Read Archive (SRA) of NCBI (https://www.ncbi.nlm.nih.gov/sra) was queried for datasets containing paired-end reads that were minimum 2 × 250 bp in length. The sequencing runs of over 160 SRA studies were examined, though some were eliminated immediately for various reasons (mislabeled as paired-end; actual read lengths shorter than stated; reads already trimmed). The remaining datasets’ reads were adapter-trimmed with NGmerge and aligned to the PhiX genome. Those with at least 10,000 read pairs aligning to PhiX in a properly-paired configuration were further analyzed. The details of these 33 datasets, which contained 1386 sequencing runs and a total of 2.25 billion read pairs, are provided in Additional file 2: Table S1.
The reads of the 33 SRA datasets were analyzed in a similar fashion to the Harvard datasets. The baseline error rates were calculated from the original reads, and error rates were also determined after processing the reads with each of the merging programs. For each set of error rates, LOESS regression functions were computed, relating the quality scores to the log (base 10) of the error rates.
The author thanks Dr. Timothy Sackton of Harvard University and Dr. W. Kelley Thomas of the University of New Hampshire for helpful discussions regarding software design and testing. Dr. Thomas and the Hubbard Genome Center also provided preliminary datasets for testing. The author thanks Harvard University, George Washington University, and Dr. Jeremy Goecks for funding support. The computations in this paper were run on the Odyssey cluster supported by the FAS Division of Science, Research Computing Group at Harvard University.
This work was not supported by any specific funding.
JMG wrote the software, analyzed the data, and wrote the paper. The author read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The author declares that he has no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 12.Chiara M, Pavesi G. Evaluation of quality assessment protocols for high throughput genome resequencing data. Front Genet. 2017;8:94.Google Scholar
- 16.Rognes T, Flouri T, Nichols B, Quince C, Mahé FVSEARCH: a versatile open source tool for metagenomics. PeerJ. 2016;4:e2584.Google Scholar
- 24.Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–9.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.