SEQdata-BEACON: a comprehensive database of sequencing performance and statistical tools for performance evaluation and yield simulation in BGISEQ-500
- 225 Downloads
The sequencing platform BGISEQ-500 is based on DNBSEQ technology and provides high throughput with low costs. This sequencer has been widely used in various areas of scientific and clinical research. A better understanding of the sequencing process and performance of this system is essential for stabilizing the sequencing process, accurately interpreting sequencing results and efficiently solving sequencing problems. To address these concerns, a comprehensive database, SEQdata-BEACON, was constructed to accumulate the run performance data in BGISEQ-500.
A total of 60 BGISEQ-500 instruments in the BGI-Wuhan lab were used to collect sequencing performance data. Lanes in paired-end 100 (PE100) sequencing using 10 bp barcode were chosen, and each lane was assigned a unique entry number as its identification number (ID). From November 2018 to April 2019, 2236 entries were recorded in the database containing 65 metrics about sample, yield, quality, machine state and supplies information. Using a correlation matrix, 52 numerical metrics were clustered into three groups signifying yield-quality, machine state and sequencing calibration. The distributions of the metrics also delivered information about patterns and rendered clues for further explanation or analysis of the sequencing process. Using the data of a total of 200 cycles, a linear regression model well simulated the final outputs. Moreover, the predicted final yield could be provided in the 15th cycle of the early stage of sequencing, and the corresponding R2 of the 200th and 15th cycle models were 0.97 and 0.81, respectively. The model was run with the test sets obtained from May 2019 to predict the yield, which resulted in an R2 of 0.96. These results indicate that our simulation model was reliable and effective.
Data sources, statistical findings and application tools provide a constantly updated reference for BGISEQ-500 users to comprehensively understand DNBSEQ technology, solve sequencing problems and optimize run performance. These resources are available on our website http://seqBEACON.genomics.cn:443/home.html.
KeywordsBGISEQ-500 Sequencing run performance Prediction tools Data analysis Database
Accumulated Good Reads Rate
Beijing Genomics Institute
Basecall information content
Combinatorial Probe-Anchor Synthesis
Flow cell identifier
Next generation sequencing
Rolling circle amplification
Single-cell chromatin accessibility and transcriptome sequencing
Single cell RNA sequencing
Single nucleotide polymorphism
Signal to Noise Ratio
Total Effective Spot Rate
Whole exome sequencing
Whole genome sequencing
Next generation sequencing (NGS, also known as high-throughput sequencing) has led us into the genomic era. In the past 15 years, the development of sequencing technology has been mainly committed to a reduction in cost and improvements in throughput, accuracy and read lengths. Currently, the sequencer manufacturers Illumina and Beijing Genomics Institute (BGI) provide high throughput and accuracy, while Pacific Bioscience and Oxford Nanopore offer long read lengths [1, 2]. The first BGI sequencer BGISEQ-500 was launched in 2015 (http://en.mgitech.cn/), which was based on two key technologies: DNA nanoball (DNB) and Combinatorial Probe-Anchor Synthesis (cPAS). In library preparation, DNA is fragmented, end-repaired and ligated with adapters. The ligation product is amplified by PCR for several cycles to become DNA libraries. These libraries are then circularized by DNA ligase with a splint oligo that was reverse complemented to one strand. Next, the DNA library circles are replicated with polymerase phi29 to form a single-stranded DNA molecule called a DNB by rolling circle amplification (RCA). The DNBs are dispersed and immobilized on a photolithographically etched, patterned flow cell by a loader machine. When sequencing, a probe is first annealed to a DNA molecular anchor on the DNB. In each cycle, DNA polymerase incorporates one base labeled with a fluorescence group, and the four colored light signals emitted by the bases are collected via a high-resolution imaging system and converted into bases after basecalling .
Recently, BGISEQ-500 platform has been widely used in a variety of sequencing applications, such as whole genome sequencing (WGS) , whole exome sequencing (WES) , RNA-seq , small RNA and metagenomics [7, 8]. BGISEQ-500 sequencing platform has not only participated in the transcriptome analysis of plant nitrogen metabolism  but also in human clinical applications for cancer genome sequencing and TP53 mutation detection in high-grade serous ovarian cancer [10, 11]. In addition, it has been reported that BGISEQ-500 has good performance at the single-cell resolution, such as in scRNA-seq and scCAT-seq [12, 13, 14]. Compared with the Illumina platforms, the BGISEQ-500 is cost-effective and has low error and duplication rates [1, 13]. The throughput per lane of BGISEQ-500 is approximately twice as HiSeq4000 in the PE100 sequencing type. The cost per gigabase (Gb) is 40–60% of that of the Illumina HiSeq4000 platform. The generation of DNBs is based on rolling circle amplification, which effectively prevents errors from PCR amplification. The sequencing data derived by BGISEQ-500 are compatible with widely used bioinformatics tools and pipelines such as GATK, bwa, HISAT, DEseq2 and SnpEff . The only difference is the setting of some software parameters. To date, the results obtained from data generated from Illumina and BGISEQ-500 platforms have been comparable. For example, BGISEQ-500 demonstrates comparable SNP detection accuracy in WGS , similar consistency in variation detection in WES , and high concordance in transcriptome and metagenomic studies [16, 17]. Therefore, DNBSEQ technology provides a new choice for resolving issues in scientific research and agriculture, environment and clinical applications.
The performance of sequencers is very important for the throughput, quality and reliability of data generated in each run. Among the sequencing performance metrics in BGISEQ-500, yield (e.g., Reads, the number of DNBs recognized by the Basecall software) and quality (e.g., Q30, the percentage of bases with an error rate below 0.001) are of the highest concern. Other metrics regarding chemical reaction and instrument state are also recorded in the sequencing summary. However, we still lack a profound understanding of these metrics, especially the connections between them. As one of the world’s largest sequencing service providers, BGI performs thousands of sequencing runs each year. Accompanied with enormous amounts of nucleotide data, massive sequencing performance data are highly valuable for illustrating this complicated process and for troubleshooting. Unfortunately, this type of datasets or databases are presently rare, and a comprehensive database is required to integrate the abundant run performance data.
In this study, we designed a database, SEQdata-BEACON, to comprehensively collect sequencing performance data from BGISEQ-500, including sample, yield, quality, machine state and supplies. We calculated Pearson’s correlation coefficients for 52 numerical metrics for hierarchical clustering and analyzed their distribution patterns. We also used linear regression to establish yield simulation models to investigate the connections among yield correlated metrics and attempted to predict the final yield at the early stage of sequencing. All the data and statistical analysis results are available on our open access website. These resources can be used as an updating reference dataset for BGISEQ-500 users in different enterprises or schools to gain a deeper understanding of DNBSEQ technology.
Data collection and database construction
DNA libraries were loaded onto a patterned array. After successive chemical reactions, signal acquisition and basecalling, the sequencers normally generate a series of folders and files in each cycle or at the end of the sequencing process. In BGISEQ-500, more than 10 files recorded the run configuration data of the entire sequencing process, such as InputInfo_*.txt, RunInfo.txt, summaryReport.html, fovReport.QC.txt, BarcodeStat.txt and fq.fqStat.txt. We chose 60 BGISEQ-500 sequencers in the BGI-Wuhan lab and collected all available files generated following chemical reactions and basecalling. The criteria for selecting metrics from these files are as follows: 1) metrics of great concern and closely related to the sequencing process ; 2) metrics covering information on the sequencing type, the optical path state of the machine etc.; 3) metrics related to traceable information for troubleshooting. Based on these, we used flow cell identifier (FC) as an index to extract 64 metrics from data resources to create a database ‘SEQdata-BEACON’. Each entry was assigned a unique identification number (ID). We accumulated lanes in paired-end 100 (PE100) sequencing since it is the major sequencing type in BGISEQ-500. No detailed sample information was entered into the database in order to protect the privacy of customers. The database was constructed on a MySQL server (version 8.0).
Web visual interface and statistical analysis
A user-friendly interface for SEQdata-BEACON was built on Apache (version 2.4.33 win64 VC15 server). The Google Chrome web browser (version 68.0.3440.106) is suggested to access the website. Statistical analysis and figures in this study were generated based on data obtained from our database with R software (version 3.5.0 x 64) with the ability to install additional packages as needed.
Yield simulation model
Y is the value of the dependent variable yield. T is the value corresponding to the current cycle in TotalEsr. D is the value of Dnbnumber at the beginning of the sequencing. B, G, S and F are the average values of BIC, accGRR, SNR and FIT, respectively, for the first 200 cycles. ε is the observed error and obeys a normal distribution. The model parameters βn are the coefficient values estimated using a regression model.
Yi is the value of the dependent variable yield. Ti is the value corresponding to the current cycle i in TotalEsr. Di is the value of Dnbnumber at the beginning of the sequencing. Bi, Gi, Si and Fi are the average values of BIC, accGRR, SNR and FIT, respectively, for the first i cycles. εi is the observed error in the i cycle and obeys a normal distribution. The model parameters βn, i are the coefficient values estimated using a regression model. The coefficient of determination (R2), which ranges from 0 to 1, was used to measure the accuracy of our model; a value closer to 1 means better performance of the model.
Evaluation of the linear regression model was performed with test sets to test the reliability of the prediction model. The test sets were used in the model formula to obtain prediction results. An R2 for the test sets was used to evaluate the constructed model. Both linear regression and the backward elimination method of stepwise regression and evaluation were conducted in R.
Data collection and database construction
Web visual interface
Statistical findings: metric features
Statistical findings: yield simulation model
Model summaries of linear regressions for predicting yield outputs
Analytical results for Eq. (3)
Analytical results for Eq. (4)
Analytical results for Eq. (5)
Continuous improvement in DNBSEQ technology has introduced more efficient sequencing platforms, such as MGISEQ-2000 and DNBSEQ-T7. Compared to the Illumina platform, BGISEQ-500 is cheap and has a low sequencing error rate. Sequencing costs are often affected by geographic, institutional, personnel, and reagent costs, and continue to decline as technology updates. According to recent research statistics , taking the PE100 sequencing type as an example and not concerning the physical loss of the sequencer, the cost per Gb for the BGISEQ-500 is half that of the HiSeq4000 platform. And BGISEQ-500 sequencing data showed a lower error rate than Illumina (< 0.1 and 0.1%, respectively) . Moreover, the sequencing data produced from BGISEQ-500 are compatible with widely used bioinformatics tools and pipelines. The data analysis results showed comparable accuracy and reproducibility. Recent investigations have reported that MGISEQ-2000 has comparable single nucleotide polymorphism (SNP) detection accuracy in WGS and high gene detection in scRNA-seq to Illumina platforms [19, 20].
In fact, the stable performance of sequencers in massively parallel sequencing guarantees the utility of data and the cost efficiency of each run. Some tools have been reported to evaluate sequencing run performance by analyzing sequencing data quality. For example, FASTQC is a commonly used tool for quality control of sequencing data and the generation of a comprehensive QC report . It can also be incorporated into an analysis pipeline to represent the quality of raw data in easy-to-browse HTML reports . The whole NGS workflow included library and template preparation, enrichment, sequencing and data analysis, but quality control (QC) checkpoints for sequencing performance were often performed in the data quality check rather than in the sequencing process . Different from data quality evaluation, the run performance metrics brought us a wealth of information that could be used to effectively assess the sequencing process and its results. To gain insight into the sequencing performance in BGISEQ-500, we established the first-reported BGISEQ-500 sequencing performance database and website to comprehensively collect performance data.
There were 2236 entries with 65 metrics containing information on sample, yield, quality, machine state and supplies in ‘SEQdata-BEACON’. The method of automatically collecting metric values from sequencing configuration files could effectively lighten human labor, shorten time costs and improve data accuracy. The run data we collected covered libraries from most species and major types of sequencing applications. At present, in our 60 BGISEQ-500 sequencers in the BGI-Wuhan lab, PE100 sequencing using a 10 bp barcode is suitable for WGS, WES and RNA-seq, and the libraries are derived from DNA or RNA samples of plants, animals, microbes and humans. In the Q30 versus Reads scatterplot, 90.6% of the lanes had reads greater than 650 M and values of Q30 above 85%, which shows that BGISEQ-500 was stable and reliable in massively parallel sequencing. Therefore, without the risk of index hopping, DNBSEQ can generate excellent sequencing data with fewer duplications and errors  and has extensive application in population-scale sequencing projects, such as the 10KP (10,000 Plants) Genome Sequencing Project . To study the correlation of yield-associated metrics, we used the backward elimination method of stepwise regression and established a yield simulation model with an R2 of 0.97. The model produced a good simulation, which suggests that TotalEsr, Dnbnumber, BIC, SNR and FIT contributed to the yield. While predicting the final production, we used all six parameters to construct 40 prediction models using stepwise regression. From the residual deviation of the models, it was shown that the final yield could be predicted in the 15th cycle at the early stage of sequencing, and the small changes in the residuals within read1 and read2 implied little fluctuation of the metrics during the sequencing process. The linear regression model is a common statistical technique for simulating the associations between variables, but whether other methods may produce better simulation results cannot be ruled out. Furthermore, we wanted to investigate quality-associated metrics and establish a quality simulation model. Combined with the yield simulation model, these two models may effectively simulate the sequencing results and bring us more ideas for increasing the sequencing performance.
Recently, the sequencer manufacturer Illumina revealed a new service named “Proactive Instrument Monitoring”, which is a proactive support service that involves remote instrument monitoring in real time . By sending instrument performance data to Illumina, the support team can monitor the instrument and resolve issues more quickly. Apart from monitoring the instrument performance, our study paid more attention to data accumulation and was expected to explore data patterns by statistical analysis and interpret sequencing results. In the future, we plan to gather more sequencing platforms built on DNBSEQ technology, which will provide an integrated performance reference for BGI sequencers and will be beneficial to fully understand this series of instruments. Moreover, we also want to add the PacBio Sequel II and Oxford Nanopore PromethION sequencers to obtain a deeper understanding of single-molecule sequencing technology. We expect SEQdata-BEACON to be a comprehensive platform: with data accumulation, it can demonstrate the actual performance of the sequencing platforms; by developing more data-mining applications, it can enrich functional tools such as QC metrics models and metrics standards; by presenting data and statistical results on the website, it can also give users useful optimization and troubleshooting suggestions to solve their problems.
Widespread application of NGS has resulted in a large amount of data, including nucleotide sequences and sequencing process performance. We designed a database, SEQdata-BEACON, to accumulate run performance data from BGISEQ-500 containing 65 metrics with information on sample, yield, quality, machine state and supplies. A correlation matrix of 52 numerical metrics was clustered into three groups: yield-quality, machine state and sequencing calibration. The distribution of numerical metrics presented some features and provided clues for further interpreting the meanings of these metrics and their analysis. We also constructed linear regression models to accurately simulate the final yield using metric values in the 200th and 15th cycles of the runs. The data sources, statistical findings and application tools are all available on our website (http://seqBEACON.genomics.cn:443/home.html), which can facilitate BGISEQ-500 users from enterprises or schools to understand DNBSEQ technology and interpret their sequencing results.
We would like to acknowledge the ongoing contributions and support of all our BGI employees.
We thank Zetao Bai (Oil Crops Research Institute, Chinese Academy of Agriculture Sciences, Wuhan, China) for her assistance editing this manuscript.
CL designed and constructed the database, and developed web application; YZ collected data, performed data examination and analysis, and prepared figures; RZ prepared tables, performed data and application examination, and wrote the manuscript; AL, BH, LL, LC, BL contributed materials and analysis tools. JH, ZT conceived and designed the experiments, approved the final draft of the manuscript submitted for review and publication. All the authors have read and approve the manuscript.
No funding agency has funded this work.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
- 13.Natarajan KN, Miao Z, Jiang M, Huang X, Zhou H, Xie J, et al. Comparative analysis of sequencing technologies for single-cell transcriptomics. Genome Biol. 2019;20(1). https://doi.org/10.1186/s13059-019-1676-5.
- 14.Zhao Y, Li X, Zhao W, Wang J, Yu J, Wan Z, et al. Single-cell transcriptomic landscape of nucleated cells in umbilical cord blood. Gigascience. 2019;8(5). https://doi.org/10.1093/gigascience/giz047.
- 18.Wang O, Chin R, Cheng X, Wu KYM, Mao Q, Tang J, et al. Efficient and unique co-barcoding of second-generation sequencing reads from long DNA molecules enabling cost effective and accurate sequencing, haplotyping, and de novo assembly. Genome Res. 2019. https://doi.org/10.1101/gr.245126.118.CrossRefPubMedPubMedCentralGoogle Scholar
- 19.Gorbachev A, Kulemin N, Naumov V, Belova V, Kwon D, Rebrikov D, et al. Comparative analysis of novel MGISEQ-2000 sequencing platform vs Illumina HiSeq 2500 for whole-genome sequencing. BioRxiv. 2019. https://doi.org/10.1101/577080.
- 20.Senabouth A, Anderson S, Shi Q, Shi L, Jiang F, Zhang W, et al. Comparative performance of the BGI and Illumina sequencing technology for single-cell RNAsequencing. BioRxiv. 2019. https://doi.org/10.1101/552588.
- 21.Andrews S. FastQC: a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed 18 Nov 2018.
- 26.Illumina Proactive Instrument Monitoring. https://www.illumina.com/services/instrument-services-training/product-support-services/instrument-monitoring.html. Accessed 20 May 2019.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.