Background

The launch of the Roche 454 sequencer [1] opened the era of next-generation sequencing (NGS). Compared with the traditional Sanger sequencing technology [2], NGS has significantly larger throughput and lower per-base cost. Taking these advantages, researchers can analyze the information of all the genes in one project, rather than doing gene-by-gene studies. Researchers can obtain the information on the whole genome, protein-coding exons, or other specified regions by performing whole genome sequencing (WGS), whole exome sequencing (WES) [3] or target region sequencing (TRS), respectively. As an easy to interpret, known functional impacts, and relatively low-cost technology comparing with WGS, WES is widely used in human genetics research nowadays.

In the short history of NGS era, five major sequencing platforms have emerged: Roche 454, Illumina Hiseq series (GA, Hiseq, Miseq, X) [4], SOLiD [5], Complete Genomics [6], and Ion Torrent [7]. These platforms use different mechanics and have their specific advantages and disadvantages [8]. After years of technology evolution and competition, Hiseq becomes the most widely used sequencing platform. In 2015, BGI and Complete Genomics jointly announced a new next-generation sequencer, BGISEQ-500 [9]. However, its performance on WES has not yet been well evaluated by the scientific community.

We evaluated the performance of BGISEQ-500 on WES by parallel comparison with Hiseq 4000 on the well-characterized human sample NA12878. We compared the concordance of variation detected between the sequencing platforms, and their variation detection accuracy with the reference variation dataset from the genome in a bottle project (GIAB) [10]. We found that BGISEQ-500 has comparable reproducibility and competitive variation detection accuracy to Hiseq 4000.

Results

Data production

The DNA of NA12878 was used in this study. Agilent SureSelect Kit v5 (50.4 Mb) was used for exome capture. The sequencing strategy was pair-end 150 bp for Hiseq4000 and pair-end 100 bp for BGISEQ-500. The DNA was sequenced to >100x on both Hiseq 4000 and BGISEQ-500 platform. (Methods) Each platform sequenced four replicable libraries, resulting in eight datasets in total. For comparison, each dataset was down-sampled to 100x. BGISEQ showed higher exome capture efficiency (faction of bases on target, BGISEQ 72% vs. Hiseq 58%, fraction of reads on target, BGISEQ 83% vs. Hiseq 71%), therefore it requires less sequencing data to reach the same sequencing depth. BGISEQ also showed slightly lower duplication rate than Hiseq (7.0% vs 7.6%). For each dataset, > 99.6% bases of the target region are covered by at least one read, and > 96% bases of the target region are covered with > = 20 reads, indicating that the whole target region is comprehensively and uniformly captured on both platforms (Table 1, Fig. 1).

Table 1 Data production
Fig. 1
figure 1

Cumulative depth distribution. The cumulative frequency is the fraction of target regions covered by given depth or higher

Variation detection

The variation detection was processed under the guidelines from the Genome Analysis Toolkit (see Methods for details) [11, 12]. Only the target region was used for variation detection. Roughly 41 thousand single nucleotide variants (SNVs) were detected from each dataset, including 19 thousand inside the protein coding region, and ~ 9.4 thousand could lead to protein change (Table 2). The BGISEQ datasets generated slightly fewer SNVs than Hiseq datasets. About 99.7% of detected SNVs could be found in dbSNP142 [13]. The transition/transversion ratio (Ti/Tv) on whole target region and on the exonic regions is 2.56 and 3.09, and the corresponding heterozygous to homozygous variation ratio (het/hom) is 1.64 and 1.52, respectively. (Table 2) These metrics from our datasets are comparable to other sources [10, 14].

Table 2 Variation detection and annotation

Roughly 3.5 thousand insertion/deletions (indels) were detected, approximately 470 out of which lie on coding regions. BGISEQ detected slightly more heterozygous indels than Hiseq, resulting in higher het/hom ratio (1.46 vs. 1.31). (Table 2) Around 95% of indels have been previously reported in dbSNP142. The indels from these datasets also showed similar length distribution (Fig. 2).

Fig. 2
figure 2

Indel length distribution. Deletions are shown as negative length whereas insertions are shown as positive. The fraction of insertions and deletions sum up to 1 separately. All datasets showed similar length distribution

Variation concordance

It has been noticed that the repetitive sequence in the genome could lead to ambiguity of short fragment alignment, which subsequently leads to false variation detection results. This could be a major cause of the SNV detection errors. [15] Using the genome mappability score [15], ~ 2.3% of the target region was identified with alignment uncertainty. These regions were eliminated, and only the mappable regions were used hereafter.

The SNVs and indels from these four datasets were compared against each other separately, and Jaccard similarity was used to measure the concordance between datasets. It is showed that SNV results have 97.6% intra-platform concordance and 96.7% inter-platform concordance, and BGISEQ has slightly higher intra-platform concordance than Hiseq. (Figure 3) The high intra-platform concordance indicated qualified reproducibility of each platform. Moreover, BGISEQ has excellent inter-platform concordance with Hiseq, suggesting that BGISEQ could substitute Hiseq in many application fields where SNVs are the primary focus.

Fig. 3
figure 3

Concordance of variation detection. The Jaccard similarity for variation detection results from datasets was calculated for SNV (top-left triangle) and indel (bottom-right triangle) separately. SNV detection showed excellent intra- and inter-platform concordance, while indel detection showed inferior concordance. The inter-platform concordance is slightly lower than intra-platform concordance

For indel, the intra-platform concordance is 82.3% for BGISEQ and 83.6% for Hiseq, and the inter-platform concordance is 81.7%. Indels with exact the same position and alternate alleles were considered as concordant, regardless of their genotypes. This concordance level is lower than SNV’s, as expected, because it is harder to detect indels from short reads using current methods [16]. It is possible that there could be different concordance metrics for indels, depending on how position and genotype concordance is defined. When only the location is considered, regardless of the genotypes, the intra-platform concordance increased to 87.7% for BGISEQ and 87.4% for Hiseq, and the inter-platform concordance increased to 85.1%. On the other hand, if we restrict concordance sites as exact genotype match, the intra-platform concordance decreased to 79.4% for BGISEQ and 81.3% for Hiseq, and the inter-platform concordance decreased to 76.5%. These results suggest that different datasets and platforms have better agreement on indel location, but diverge on zygosity and genotypes.

Variation accuracy

As the sequenced sample NA12878 had been well characterized by genome in the bottle project (GIAB) [10], the genotype result from GIAB was used as the reference to benchmark variants accuracy. Only the high confidence regions from the GIAB dataset was used for evaluation. Sensitivity and precision were used during the evaluation. Variation detected by both test data and GIAB were considered as true positive if they have exactly the same positions, regardless of whether they have the same genotypes. Overlapping indel positions were not considered as the same one unless they have the same start and end positions.

Regarding SNV detection, 35,210 SNVs were found in the GIAB dataset, and the sensitivity and precision from datasets is 99.0 and 99.4%, respectively. Both platforms have excellent SNV detection accuracy. Furthermore, within 196 false positive and 356 false negative sites per sample, 126 (64%) and 204(57%) are concordant in all 8 samples, respectively, indicating errors in the reference set or systemic exome sequencing bias. Taking this into consideration, the actual accuracy from these platforms should be higher, and the difference between datasets on these loci is minimal.

For indel detection, 2650 indels are found in the GIAB dataset, and the sensitivity is 93.8 and 92.6%, and the precision 97.2 and 98.0% for BGISEQ and Hiseq, respectively. BGISEQ showed higher sensitivity but lower precision than Hiseq. Thirteen false positive and 42 false negative sites are concordant across samples, contributing to 22.9 and 23.6% total false positives and false negatives on average. To achieve better indel detection performance is still challenging for exome sequencing and GATK pipeline.

Impact of sequence depth and read length

To analyze the sequence depth effects to variation detection, the raw data were down-sampled to various sequence depths (20x, 30x, 50x, 70x, 100x, 150x). The SNV detection sensitivity increases with increased sequence depth, and it plateaus after the sequence depth exceeds 100x (Fig. 4). The increase in sensitivity may be due to better coverage of the target with increased sequence depth. On the other hand, the SNV detection precision stays constant while sequence depth increased, showing that the model has reached its limit when the depth is greater than 20x. For indel detection, the sensitivity increases while sequence depth increasing, as expected. But it does not reach a plateau even when the sequence depth is as high as 150x, showing that additional data is required for a better indel detection.

Fig. 4
figure 4

Variation detection accuracy versus sequence depth. Raw data were down-sampled to 20x, 30x, 50x, 70x, 100x, and 150x to generate this plot. Variations on the high confidence regions from the genome in the bottle project were used as the reference

By truncating read length to 100 bp, four additional Hiseq pair-end 100 bp (PE100) datasets were generated from Hiseq datasets respectively, with the same sequence depth used in the above evaluation. Compared to Hiseq PE150 datasets, PE100 datasets showed similar precision and sensitivity on SNV detection. On indel detection, PE150 and PE100 have similar precision while PE100 showed slightly higher sensitivity (Table 3). The result suggests that the read length has no significant impact on variation detection accuracy at this sequencing depth level.

Table 3 Variation accuracy estimation by comparison with GIAB

Discussion

The reproducibility of a method and its consistency with other available methods are essential for its application in academic and clinical scenarios. The aim of this study was to evaluate these characteristics of BGISEQ-500 in WES studies. Considering that WES is widely used in human genetics research and approaching clinical use, the validation of this newly established platform is crucial.

By comparison with Hiseq4000 data, this study showed BGISEQ could achieve comparable coverage to Hiseq for exome capture and sequencing procedure. For variation detection, both platforms have high and comparable reproducibility, although the reproducibility of indels is lower than of SNVs. Furthermore, the inter-platform concordance is commensurate with intra-platform concordance on SNV detection. This indicates that BGISEQ WES is capable of the applications which Hiseq WES SNV data has been tested and verified for.

For data usage, BGISEQ showed higher exome capture efficiency, and requires about 25% less data than Hiseq to reach the same sequencing depth. If the sequencing cost per gigabase is comparable between the two platforms, BGISEQ will have a lower cost per sample to obtain the same amount of effective data. Furthermore, smaller dataset also means less computational resources and runtime for bioinformatics analysis.

Unlike former Complete Genomics sequencers, the raw data generated from BGISEQ-500 is in fastq format, the de facto standard format for NGS data. As a result, the data is acceptable to most of the commonly used analysis software. This allows scientists to manipulate the data by themselves, to adapt the cutting edge analysis methods, and to compare results with other data easily. As an illustration, the commonly used bwa-GATK pipeline was applied to both Hiseq and BGISEQ data seamlessly in this study, keeping the results clear of biases possibly introduced by different analysis software.

It is important to note that BGISEQ is still in its rapidly evolving phase and the amount of data generated from BGISEQ is limited. Therefore, the results showed in this study could be limited by its sample size and the current state of the instruments. The full evaluation and validation of this platform requires more data from research and clinical scenarios.

Conclusion

In this study, we evaluated the performance of BGISEQ-500 on WES and compared it with Hiseq4000 WES and GIAB high-confidence variation dataset. BGISEQ showed high reproducibility and concordance in intra- and inter- sequencing platforms. Both BGISEQ and Hiseq platforms demonstrated adequate variation detection accuracy on the benchmark region. These results suggest that BGISEQ-500 is a qualified sequencing platform for WES.

Methods

Data production

The DNA of NA12878 was acquired from Coriell Institute (Catalog ID NA12878). Agilent SureSelect Kit v5 was used for exome capture. The library construction and sequencing procedure on BGISEQ-500 were as described in BGISEQ sequencing section, with a ~ 170 bp insert size and pair-end 100 bp sequencing strategy. The procedure on Hiseq4000 followed the manufacturer instructions with a 250~300 bp insert size and pair-end 150 bp sequencing strategy. The DNA was sequenced to >100x on both platforms. Each platform sequenced four replicable libraries. Each dataset was randomly down-sampled to 100x for comparison.

BGISEQ sequencing

DNA preparation

1 μg DNA (Qubit quantified) was sheared by Covaris and double selected with Ampure XP beads to acquire fragments around 170 bp. End repairing, A-tailing and Ad153 index adapter ligation of 50 ng size-selected DNA (0.6*0.8 Ampure XP beads double-selection) were performed in a single tube for a total time of 1.75 h, followed by a purification with 50 μL of Ampure XP beads and 20 μL of TE buffer. After PRE-PCR following the 95 °C 3 min, (98 °C 20s, 60 °C 15 s, 72 °C 30s) 8 cycles, 72 °C 10 min,4 °C hold thermal cycles using KAPA HiFi Hot Start Ready Mix, the product was purified with 1X Ampure XP beads and quantified with Qubit BR ds DNA kit.

Exome capture

The hybridization of BGISEQ-500 library was performed according to the Agilent SureSelect protocol with the following optimized parameters: 1000 ng purified DNA was used for hybridization, the index block and PCR block of Agilent were replaced by a corresponding Ad153_index block (one for all indexes) and Ad153_PCR block. KAPA HiFi Hot Start Ready Mix was used for Post-PCR following the 95 °C 3 min, (98 °C 20s, 60 °C 15 s, 72 °C 30s) 13 cycles, 72 °C 10 min, 4 °C hold thermal cycle. Post-PCR products were purified with 1X Ampure XP beads and quantified with Qubit BR ds DNA kit. The fragment size distribution was analyzed using the Agilent 2100 Bioanalyzer and DNA 1000 kit.

Single strand DNA (ssDNA) circle construction

300 ng Post-PCR products were denatured at 95 °C for 3 min (with heated lid at 105 °C) and transferred to 4 °C quickly to make a single strand DNA circle (ssDNA circle). After heat denaturation, the splint oligo binds to the adapters on both ends of a single Post-PCR product, guiding both ends of the single strand to adjacent positions. The following ligation reaction using T4 DNA ligase at 37 °C for 30 min helps connect the adjacent bases on different ends of the single strand with a phosphodiester bond to complete the circularization. An enzyme digestion using Exo I and Exo III at 37 °C, 30 min was implemented to eliminate uncirculated DNA. The libraries were purified with 168 μL of Ampure XP beads and quantified with Qubit BR ssDNA kit. The resulting ssDNA circle is the final library.

Make DNA nanoballs (DNBs)

DNBs were generated from the ssDNA circle using rolling circle amplification (RCA) to enhance the fluorescent signals in the sequencing process [6]. Primer mix bind to 6 ng ssDNA circles at 95 °C for 1 min, 65 °C for 1 min, and 40 °C for 1 min. 40 μL Phi29 DNA polymerase and 4 μL SSB was added for RCA reaction at 30 °C for 30 min, then quickly transferred to 4 °C. The reaction was ended completely by 20 μl stop buffer, generating even-sized DNBs which will have similar fluorescent intensity in the sequencing process to ensure the signal chastity. Compared to PCR amplification, RCA has no PCR error accumulation and no PCR bias because the original ssDNA circle is the only template during the entire amplification process [6]. Unlike emulsion or bridge PCR, the rolling-circle amplification does not require precise titration of template concentrations in situ and circumvents stochastic inefficiencies.

Loading and sequencing

The DNBs were combined with 1/4 volume of DNB loading buffer and an appropriate amount of PBS buffer to a total volume of 140 μL, and placed on the loader machine. The DNBs were loaded onto the flow cell in which DNB binding sites are patterned nano-arrays. Sequencing data were generated with pair-end 100 bp sequencing strategy on the BGISEQ-500 platform.

Variation detection

The variation detection proceeded under the guidelines from Genome Analysis Toolkit (GATK) [11, 12]. Reads were aligned to human reference genome hg19 using bwa-mem (version 0.7.15) [17] with default parameters. The bam files were sorted, merged and library duplications were identified using Picard (https://github.com/broadinstitute/picard, version 2.5.0). After that, GATK (version 3.3) was applied to refine reads around indels, and recalibrate base quality. Variation calling on the capture region was carried out by GATK HaplotypeCaller with ‘--emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000’. Additionally, because the BGISEQ library construction protocol introduced more PCR cycles, we used ‘-pcrModel AGGRESSIVE’ for BGISEQ datasets. The gvcfs were then genotyped by GenotypeGVCFs with ‘-stand_call_conf 30 -stand_emit_conf 10 -allSites’. Raw SNVs and indels were extracted by SelectVariants separately. To obtain high quality variants, hard filter was applied using GATK VariantFiltration but with separate criteria for each platform. We found that the GATK recommendation criteria worked well for illumina data (‘QD < 2.0 || FS > 60.0 || MQ <40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0’ for SNVs; and ‘QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0’ for indels), but it looks poor for the BGISEQ data. This is reasonable because these recommendation were specific based on Illumina data, and it is reasonable to assume that BGISEQ data have different characteristics. A different criteria was tuned following the hard-filter tuning scheme from GATK (https://software.broadinstitute.org/gatk/documentation/article.php?id=6925) based on additional inhouse BGISEQ data, and applied on BGISEQ data (‘QD < 2.0 || FS > 60.0 || ReadPosRankSum < -8.0’ for SNVs; and ‘QD < 4.0 || FS > 200.0 || ReadPosRankSum < -8.0’ for indels). The final call-sets were then annotated by SnpEff (version: 4.0) [18].

Variation concordance analysis

Regions with genome mappability scores (GMS) [15] less than 1 were excluded from the evaluation. Jaccard similarity (number of sites where both datasets detected as SNV divided by the number of sites where at least one dataset is detected as SNV) was used to measure the concordance between datasets.

Variation accuracy estimation

The genotype result from GIAB was used as the reference to benchmark variation accuracy. Only the GIAB high confidence regions with GMS equal to 1 was used. During the evaluation, precision (true positive/(true positive + false positive)) instead of specificity (true negative/(true negative + false positive)) was used because true negative dominated the dataset. Variation loci detected in both test data and GIAB were considered as true positive.