Optimal sequencing depth design for whole genome re-sequencing in pigs
As whole-genome sequencing is becoming a routine technique, it is important to identify a cost-effective depth of sequencing for such studies. However, the relationship between sequencing depth and biological results from the aspects of whole-genome coverage, variant discovery power and the quality of variants is unclear, especially in pigs. We sequenced the genomes of three Yorkshire boars at an approximately 20X depth on the Illumina HiSeq X Ten platform and downloaded whole-genome sequencing data for three Duroc and three Landrace pigs with an approximately 20X depth for each individual. Then, we downsampled the deep genome data by extracting twelve different proportions of 0.05, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and 0.9 paired reads from the original bam files to mimic the sequence data of the same individuals at sequencing depths of 1.09X, 2.18X, 3.26X, 4.35X, 6.53X, 8.70X, 10.88X, 13.05X, 15.22X, 17.40X, 19.57X and 21.75X to evaluate the influence of genome coverage, the variant discovery rate and genotyping accuracy as a function of sequencing depth. In addition, SNP chip data for Yorkshire pigs were used as a validation for the comparison of single-sample calling and multisample calling algorithms.
Our results indicated that 10X is an ideal practical depth for achieving plateau coverage and discovering accurate variants, which achieved greater than 99% genome coverage. The number of false-positive variants was increased dramatically at a depth of less than 4X, which covered 95% of the whole genome. In addition, the comparison of multi- and single-sample calling showed that multisample calling was more sensitive than single-sample calling, especially at lower depths. The number of variants discovered under multisample calling was 13-fold and 2-fold higher than that under single-sample calling at 1X and 22X, respectively. A large difference was observed when the depth was less than 4.38X. However, more false-positive variants were detected under multisample calling.
Our research will inform important study design decisions regarding whole-genome sequencing depth. Our results will be helpful for choosing the appropriate depth to achieve the same power for studies performed under limited budgets.
KeywordsGenome coverage Sequencing depth Pig Whole-genome sequencing
The Genome Analysis Toolkit
Next Generation Sequencing
Open Reading Frame
Single Nucleotide Polymorphism
- Ti/Tv ratio
Sequencing technologies have been widely used in many fields, such as human medicine , evolutionary chemistry [2, 3, 4, 5], microbial ecology , agriculture  and animal breeding . In sequencing, a key consideration is the sequencing depth, which is defined as the ratio of the total number of bases obtained by sequencing to the size of the genome or the average number of times each base is measured in the genome . Sequencing cost is the main concern in practice, which is mainly influenced by the sequencing depth, sequencing technology and sample size. Although the sequencing cost for a particular sample has decreased significantly in recent years, it is still a great burden for large-scale applications.
Sequencing depth has a great impact not only on sequencing cost but also on the biological results of sequencing data processing, e.g., the genomic assembly completeness and accuracy of a de novo assembly , the number of detected genes and expression levels in RNA-Seq , the proportion of rare variants and SNVs detected , and the accuracy of SNP calling and genotyping in whole-genome sequencing . Therefore, it is particularly important to investigate sequencing depth to achieve a higher accuracy at a lower cost and to identify trade-offs between sequencing data quality and quantity.
Recently, there have been many studies on the impact of sequencing depth in RNA-Seq that have aimed to find the optimal sequencing depth for either de novo transcriptome assembly through the comparison of different nonmodel animals  or the study of gene expression in RNA-Seq . In addition, similar plant studies have been carried out to investigate the optimal transcriptome coverage in Hevea brasiliensis . For DNA sequencing, the research on sequencing depth has mainly focused on de novo genome assembly [10, 16] and genetic association studies of complex traits [17, 18]. Additionally, several studies have explored the recommended coverage for reducing indel calling errors  and detecting copy number variations  in sequencing data. The impact of sequencing depth on single-cell sequencing has also been explored recently [21, 22]. However, most of the studies on sequencing depth conducted thus far have focused on simulated data [18, 23] or real datasets from humans  or pilot animals with small genomes, such as E. coli, S. kudriavzevii and C. elegans ; a limitation of simulated data is that mismapping around short indels cannot be taken into account , and these data cannot fully mimic the real situation in sequencing. Such investigations have seldom been conducted on animals with large genomes, such as pigs, and this research is useful not only for humans but also for livestock and other mammals.
In this study, we sequenced the whole genomes of three Yorkshire boars at a sequencing depth of approximately 20X. In addition, whole-genome sequencing data from three Landrace and three Duroc boars were downloaded from the NCBI SRA database to explore the relationship between sequencing depth and biological results from the aspects of whole-genome coverage, the variant discovery power and the quality of variants.
Animal ethics statement
Necessary permission was obtained from the owner of the farm for collecting the samples and using in the next study. All animal management and sample collection procedures strictly followed the protocol approved by the Institutional Animal Care and Use Ethics Committee (IACUC) at the China Agriculture University. And the IACUC of the China Agricultural University specifically approved this study (Permit Number: DK996).
Blood samples from three Yorkshire boars were collected from a breeding farm in Beijing. Genomic DNA was extracted from the blood samples by using a TIANamp Blood DNA spin kit (DP348; Tiangen, Beijing) following the manufacturer’s protocol. The quality of all DNA samples was evaluated by agarose gel electrophoresis, and accurate quantification of DNA concentrations was conducted with a Qubit 2.0 fluorometer. Whole-genome sequencing was performed using the Illumina HiSeq X Ten platform according to the manufacturer’s standard protocols and produced 150-bp paired-end reads in fastq format. Whole-genome sequencing data for three Duroc and three Landrace pigs in sra format were downloaded from the NCBI SRR database and then converted to fastq format by fastq-dump in the SRA Toolkit. All individuals that we selected were unrelated to each other.
Sequencing quality control and NGS data processing
To avoid reads with artificial bias, quality control was conducted by using the NGS QC Toolkit . First, IlluQC.pl with the default parameters was used to remove reads that contained more than 30% low-quality (quality value ≤20) bases. Second, TrimmingReads.pl was used to trim the 3′ end of fragments. Then, high-quality paired-end reads were mapped to the pig reference genome sequence (Sscrofa11.1 http://hgdownload.soe.ucsc.edu/goldenPath/susScr11/bigZips/) by the BWA (Burrows-Wheeler Aligner)  with the command ‘bwa men -M -R’.
SNP calling and filtering
The Genome Analysis Toolkit (GATK)  (version: 3.7) was used to call SNPs, following GATK best practices , in which realignment and recalibration were included. SNP calling and genotyping were performed by UnifiedGenotyper in GATK. Only SNPs on autosomes were used for the following analysis. Before SNP calling, Picard SortSam and Picard MarkDuplicates (http://broadinstitute.github.io/picard/) were used to sort and mark potential PCR duplicates separately. After SNP calling, hard filters were used to remove potential false-positive SNPs and InDels. For SNPs, the following criteria were used for filtering, as suggested by the GATK documentation: “QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0”. For InDels, the criteria “QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0” were used for filtering, also suggested by the GATK documentation. Both single-sample calling and multisample calling (three samples from each breed) implemented in the GATK UnifiedGenotyper were used in our analysis.
Principal component analysis (PCA) was performed via GCTA(a tool for Genome-wide Complex Trait Analysis) [28, 29] including all common SNPs in all individuals after filtering by the minor allele frequency (MAF ≥ 0.05) and LD pruned (--indep-pairwise 100 50 0.5). The heatmap of the genomic relationship between each individual was plotted by using a heatmap in R.
Construction of samples with different sequencing depths
ANNOVAR  was used to annotate the variants with the dbSNP database. Following Ai et al. , those variants that met one of the following criteria were defined as potential loss-of-function (LoF) variants: (1) a SNP or small Indel within a coding region causing a frameshift of the open reading frame (ORF); (2) a SNP or small indel at a splice site; and (3) a SNP or small indel within a coding region resulting in a stopgain or stoploss.
Comparison of data with different depths
After variant calling and filtering, we compared data with different depths from the following aspects: (1) whole-genome coverage; (2) the number of SNPs discovered; (3) the discovery power for variants; and (4) the quality of variants evaluated by the novel rate and transition/transversion ratio.
The discovery power for each subsample is the ratio of variants in downsampled data to that in the original deep sequencing data.
The novel rate represents the proportion of variants not available in the dbSNP database to the total variants, as we defined the variants that were not included in the dbSNP database as novel variants. Here, we used the pig dbSNP database (Build ID: 150).
The transition/transversion ratio (Ti/Tv ratio) is the proportion of the variants observed as transitions (between purines, or between pyrimidines) versus transversions (between purines and pyrimidines). The Ti/Tv ratio is particularly useful for assessing the quality of single nucleotide polymorphisms inferred from sequencing data [31, 32]. A higher ratio generally indicates higher accuracy .
GATK DepthOfCoverage was used to compute the whole-genome coverage, and GATK VariantEval and R  were used to extract summary statistics from the results and for further analysis.
To further evaluate the accuracy of SNP calling, all samples of Yorkshire boars were genotyped with the PorcineSNP80K BeadChip (Illumina, San Diego, CA, USA), including 68,528 SNPs covering the whole genome. Genotype quality control was carried out with PLINK1.9 , in which SNPs with call rates less than 95% and nonchromosomal SNPs were excluded. The common sites between each subdepth and beadchip dataset were counted, which were defined as common sites on the beadchip. The discordance rate for each sample and genotype caller was calculated for single-sample calling, which was defined as the fraction of the number of common sites for the sample between the PorcineSNP80K BeadChip and the whole-genome sequencing data processed by the genotype caller according to the number of genotypes that mismatched the chip genotype. For comparison between single-sample calling and multisample calling, the discordance rate for multisample calling was calculated, defined as the proportion of the number of genotype discordance sites to the common sites of the sample between the PorcineSNP80K BeadChip and whole-genome sequencing SNPs containing at least one nonreference allele.
Summary of whole-genome sequencing and subsampling
Summary statistics for the whole genome sequencing of data
Summary statistics of variants discovered for three pig breeds, Duroc, Landrace and Yorkshire
%variants in dbSNP
Summary statistics of variants discovered for each individual
%variants in dbSNP
The impact of depth on the coverage
Discovery of variants and quality of variants
The accuracy of SNP calling and genotyping was evaluated based on the novel rate and the transition/transversion ratio, as illustrated in Fig. 5c and Fig. 5d. The novel rate decreased with increasing sequencing depth (Fig. 5c), showing that the greater the depth of sequencing, the greater the number of reliable variants discovered. However, a large differences were observed between Duroc, Yorkshire, and Landrace. The sharpest decrease in the novel rate occurred in Yorkshire when the depth increased from 1.10X to 4.38X, accounting for 16.99 and 13.45% of the variants in the dbSNP database, respectively, and then remaining basically unchanged, with a ratio of ~ 13.2%, when the depth was greater than 4.38X, which may indicate that more false-positive variants were discovered when the depth was less than 4.38X. However, for Duroc and Yorkshire, the decline in the novel rate was slow when the depth was increased from 1X to 22X.
The Ti/Tv ratios for each depth in all samples ranged from 1.99 to 2.34.
as shown in Fig. 5d; Yorkshires presented higher Ti/Tv ratios than Durocs and Landraces, and the Ti/Tv ratios for Durocs and Landraces were similar. In general, the variation of the Ti/Tv ratio was not large; when the sequencing depth was increased from ~1X to ~21X, the ranges of the Ti/Tv ratio were 2.00~2.21, 2.01~2.21, and 2.31~2.30 for Duroc, Yorkshire and Landrace, respectively. Only the Ti/Tv ratios for Duroc and Landrace increased with depth, indicating higher accuracy of SNP calling. However, the Ti/Tv ratio basically remained unchanged for Yorkshire.
Figure 7b shows the discordance rate with the SNP panel for Yorkshire under two scenarios of multisample and single-sample calling strategies. As expected, the discordance rate decreased with increasing depth. The two SNP calling strategies showed different performances. For single-sample calling, the discordance rate decreased slightly to 3.22% from 4.48% when the depth was increased from 1.10X to 21.88X, while multisample calling yielded a much higher discordance rate of 27.16% at a depth of 1.10X, which sharply decreased with increasing depth, then stabilized, with the discordance rate reaching 2.96% at a depth of 21.88X.
In this study, we aimed to provide a comprehensive understanding of the relationship between data quality and the quantity of SNP calling and genotyping in pig whole-genome sequencing. Three popular pig breeds, Duroc, Yorkshire and Landrace, were examined, and three boars from each breed were sequenced at approximately 20X depth. To the best of our knowledge, there have been no similar studies in other livestock, such as cattle or poultry. Our findings can therefore serve as a general guide for researchers to choose an optimal sequencing depth. We extracted paired read randomly from the original bam files at proportions of 0.05, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and 0.9 to mimic twelve different depths of genome sequencing data. Our results showed that the higher the depth of sequencing, the more novel variants were found, and the rate of false-positive variants was increased dramatically when the depth was lower than ~4X, especially when the depth was less than 2X. Additionally, the genome coverage of sequencing increased with depth, and a sequencing depth of 10X achieved 99% genome coverage (Fig. 4) and ensured high-quality genotyping for pigs (Fig. 5c; Fig. 5d; Fig. 7b). According to our findings, a depth of 10X was not only the point at which saturation was achieved for the function of coverage with depth (Fig. 4) but also the point at which the increase ratio for the total variants discovered slowed (Fig. 5b) and a plateau of discordance with the beadchip data was observed for multisample calling (Fig. 7b). More than 95% of the genome was covered at 4.35X, which is consistent with other investigations. Rashkin et al.  reported that 5-10X was a sufficient sequencing depth to detect common variants and identify associations for a fixed sequencing capacity in simulated data and human datasets. Although sequencing with greater depth can offer more information, many studies have indicated that low-coverage sequencing of large samples is much more cost-effective and powerful than the deep sequencing of fewer individuals. Li et al.  reported that when the frequency of variants is greater than 0.2%, only approximately 20% of the effort is needed to sequence 3000 individuals at 4X depth while achieving similar power to the sequencing of > 2000 individuals at 30X depth. Additionally, Keel’s research showed that sequencing a large number of individuals at 4-6X provides higher power than sequencing a smaller number of individuals at a great depth for rare variant detection . The sequencing cost of the latter approach is much higher than that of the former. Therefore, considering the sequencing cost and convenience, a 4X depth is necessary to achieve more accurate genotyping of pigs since the number of false-positive variants increases dramatically when the depth is less than 4X. According to the related literature, 4X is also the depth used in the 1000 Genomes Project for the discovery of disease-associated variants associated with complex diseases in humans , and a 10X or greater depth has been used for the assessment of genome-wide genetic variation [37, 38] in a pig population or the detection of selection signatures [39, 40]. In general, 4X is an appropriate depth for genome-wide association studies, and 10X is an appropriate depth for accurate genotyping and population genetic studies.
To evaluate variant quality for each downsampled genome, the criteria of the novel rate, Ti/Tv ratio, and genotype concordance to the greatest depth (21.75X) were investigated. The novel rate gave us a general idea of the accuracy of variant calling and genotyping. The Ti/Tv ratio is an important criterion for assessing the quality of SNP calling , which is expected to be 2.1~2.2 for whole-genome variants . Furthermore, a higher Ti/Tv ratio usually indicates higher accuracy of SNP calling [31, 41, 42]. Our results regarding the Ti/Tv ratio were in agreement with the expected ratio. Compared with Duroc and Landrace, Yorkshire presented a higher Ti/Tv. Low-coverage sequencing always introduces false-positive variants in NGS data analysis, but how low this coverage is remains unclear. Our results showed that the false-positive rate was increased significantly when the sequencing depth was less than ~4X (Fig. 4c; Fig. 5), while the novel rate was extremely high, and the Ti/Tv ratio and the concordance rate were also low, indicating that resequencing at depths lower than 4X could provide inaccurate variants. Moreover, the novel rate increased sharply when the depth was greater than 2.18X (Fig. 4c), which indicated that more false-positive variants were discovered when the depth was less than ~2X. With further decreases in sequencing depth, 2X is the lower boundary to ensure the quality and coverage of sequencing. This conclusion is in agreement with the simulation study by Fumagalli  showing that 2X is the minimum sequencing depth for obtaining accurate estimates of allele frequencies and identifying polymorphic sites. For the comparison of SNP calling across the three pig breeds, the coverage and concordance with the greatest depth as a function of depth showed no difference across breeds; however, the novel rate and Ti/Tv ratio as a function of depth differed between Yorkshire and the other two breeds (Duroc and Landrace). According to B. N. Keel et al. , the average Ti/Tv ratios were 2.183, 2.206 and 2.243 for 12 Duroc, 12 Landrace and 48 Yorkshire-Landrace composite sows based on Illumina HiSeq 2500 technology and alignment to the Sscrofa10.2 reference genome, resulting in a mean of 6.1-fold coverage per genome. In addition, the Ti/Tv ratio varies greatly by genome region and function ; the Ti/Tv ratio is generally approximately 3.0 for exome sequencing data and approximately 2.0 outside of exome regions . To our knowledge, it is most likely that biological factors lead to the differences in the Ti/Tv ratio across breeds.
Genotype imputation is widely used in whole-genome association studies and genomic prediction/selection, where a number of individuals are sequenced at a great depth as a reference panel, after which the imputation of individuals on the basis of SNP chip data or low-depth sequencing would be cost efficient. Sequencing data are becoming increasingly important for purposes such as association studies, genomic selection, etc., in which large samples are essential. Thus, to balance sequencing cost and efficiency, the sequencing strategy should be taken into account in practice. Two-stage sequencing has been suggested as a strategy in which some portion of a sample is first sequenced at high coverage as a reference panel, after which the larger sample is sequenced at low coverage, which has proved to be powerful, effective and practical approach . Moreover, STITCH , which is a method that was developed for the imputation of genotypes based on sequencing data without the use of additional reference panel or array data, achieves a high imputation accuracy for ultralow-coverage sequencing. The approach resulted in accuracy values of 0.948 and 0.922 for sequencing data for outbred mice (0.15X) and Han Chinese people (1.7X), respectively. Furthermore, GeneImp  was developed for the imputation of ultralow-coverage sequencing data (<1X) with a reference panel, which achieved an even higher accuracy of 0.9. With the development of algorithms and software for low-coverage sequencing or even ultralow-coverage sequencing, additional applications of low-coverage sequencing may be developed, and our research can provide basic guidance for such applications.
In this study, we also compared single-sample calling and multisample calling algorithms. The single-sample calling algorithms were simple, making use solely of reads collected at a single genome position for that sample. However, the multisample calling algorithm included all sample information for a single site. According to our results, multisample calling revealed more variants than single-sample calling, and the lower depth of sequencing, the greater the difference was, with 13-fold and two-fold differences in the numbers of variants discovered via multisample calling compared to single-sample calling when sequencing was performed at 1X and 22X, respectively. Additionally, multisample calling produced more false-positive variants than single-sample calling when the depth was less than 10X. Similar results were found in Bizon’s research  and Liu’s research  conducted in a Native American population and another human dataset, respectively. Our results further confirmed the marginally lower nonreference discrepancy value observed for identified single-sample variants than variants obtained via the multi-sample method in sequence data from 65 cattle . Our results suggested that stricter quality control parameters should be implemented in multisample calling, especially when the depth is less than 10X.
In this study, we explored the relationship between sequencing depth and whole-genome coverage, discovery power, and the accuracy of SNP calling across three pig breeds, Duroc, Landrace and Yorkshire. The genotyping accuracy of the sequencing data was validated with PorcineSNP80K BeadChip data for Yorkshire pigs as well. In addition, multisample and single-sample strategies for SNP calling were compared. Our results showed that a depth of 10X was the point at which saturation was reached for the function of coverage, covering 99% of the whole pig genome, accounting for 84.42% of the variants obtained from the deepest genome coverage (21.75X in this study), ensuring good quality of variants from the aspects of the novel rate, Ti/Tv ratio, and beadchip validation. Additionally, more false-positive variants were detected when the depth was less than 4X, suggesting that 4X is the low boundary for reasonable sequencing quality. Compared to single-sample calling, multisample calling was more sensitive, especially at lower depths, and more false-positive variants were detected as well; stricter quality control parameters should be implemented in multisample calling.
XDD and QZ contributed to the design of the study. YFJ performed the majority of the analysis with contributions from SW. YJ performed the sampling and experiments. XDD and YFJ wrote the manuscript. All authors read and approved the final manuscript.
This work was funded by grants from the earmarked fund for China Agriculture Research Systems (CARS-35) and the National Natural Science Foundation of China (31671327), which covered the costs of library preparation and sequencing in this study. Sample collection was funded by the Major Project of Selection for New Livestock and Poultry Breeds of Zhejiang Province (2016C02054–5) and the Anhui Science and Technology Key Project (Grant No. 17030701008). The Program for Changjiang Scholars and Innovation Research Teams in Universities (IRT_15R621) covered the publication cost of this paper and provided support for YFJ and YJ.
Necessary permission was obtained from the owner of the farm for collecting the samples and using in the next study. No benefit interests between owner and the authors. All animal management and sample collection procedures strictly followed the protocol approved by the Institutional Animal Care and Use Ethics Committee (IACUC) at the China Agriculture University. And the IACUC of the China Agricultural University specifically approved this study (permit number DK996).
Consent for publication
The authors declare that they have no competing interests.
- 18.Xu C, Wu K, Zhang JG, Shen H, Deng HW. Low-, high-coverage, and two-stage DNA sequencing in the design of the genetic association study. Genet Epidemiol. 2016;41:(3).Google Scholar
- 19.Fang H, Wu Y, Narzisi G, Orawe JA, Barrón LTJ, Rosenbaum J, Ronemus M, Iossifov I, Schatz MC, Lyon GJ. Reducing INDEL calling errors in whole genome and exome sequencing data. Genome Med. 2014;6(10):89.Google Scholar
- 25.Li H, Durbin R. Fast and accurate short read alignment with burrows–wheeler transform: Oxford University press; 2009.Google Scholar
- 33.Team RDC: R : A language and environment for statistical computing. R Foundation for statistical computing, Vienna, Austria. Computing. 2011;14:12–21.Google Scholar
- 34.Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ: Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience,4,1(2015-02-25) 2015, 4(1):7.Google Scholar
- 40.Carl-Johan R, Hendrik-Jan M, Alvaro MB, Khurram M, Shumaila S, Doreen S, Chao W, ?Rjan C, Patric J, Jørgensen CB: Strong signatures of selection in the domestic pig genome. Proc Natl Acad Sci U S A 2012, 109(48):19529–19536.Google Scholar
- 41.Keel BN, Nonneman DJ, Rohrer GA. A survey of single nucleotide polymorphisms identified from whole-genome sequencing and their functional effect in the porcine genome. Anim Genet. 2017;48:(4).Google Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.