Introduction

One of the key aims of livestock genomics research is to identify genetic variation underlying economically important traits such as reproductive performance, feed efficiency, disease resistance/susceptibility, and product quality. Until recently, association studies and genomic predictions have been performed using commercial single nucleotide polymorphism (SNP) arrays, which contain markers evenly spaced across the genome. Accuracy of genomic predictions can be improved by using more informative variants, including variants located near or within genes, predicted to affect gene function, or known to be causal. Genetic variants detected from whole-genome sequence have been used to successfully identify causal variants and to map complex traits in domestic cattle [1,2,3,4]. In particular, it was shown that loss-of-function (LOF) variants, those expected to disrupt the protein coded by a gene, in the homozygous state can compromise fertility in cattle by causing embryonic lethality [1]. Hence, a comprehensive list of LOF variants, as well as variants that alter the amino acid structure of a protein and those that regulate protein production, detected from whole-genome sequencing would be of considerable interest in swine genomic studies, particularly those targeting fertility and production traits.

The porcine variation currently reported in National Center for Biotechnology Information (NCBI) genetic variation database (dbSNP) and the European Variation Archive (EVA) represents several diverse pig breeds and wild boars from different regions of the world. Therefore it is likely that many of the variants in these databases do not segregate in commercial swine germplasm, a consequence of domestication and selection for lean meat production. To provide information on variants predicted to affect gene function in commercial swine germplasm, a study was conducted to identify single nucleotide polymorphisms (SNP) from whole-genome sequence of 240 members of an experimental swine herd at the U.S. Meat Animal Research Center (USMARC). These animals included all 24 of the founding boars (12 Duroc and 12 Landrace), 48 of the founding Yorkshire-Landrace composite sows, 109 composite animals from generations 4 through 9, 29 composite animals from generation 15, and 30 purebred industry boars (15 Landrace and 15 Yorkshire) used as sires in generations 10 through 15. Interim results from the 72 founding animals have been previously published [5]. The objective of this report is to present the updated data following sequencing of an additional 168 animals and the release of the new and improved swine reference genome build, Sscrofa 11.1.

Main text

Materials and methods

The DNA samples sequenced for this study were extracted from semen collected by commercial artificial insemination services and from blood and tail tissue archived under standard operating procedures for the USMARC tissue repository. The research did not involve experimentation on animals requiring Institutional Animal Care and Use Committee approval.

Library preparation and sequencing

Blood, semen, or tail tissue samples were obtained from 240 members of a USMARC composite population, 72 founding animals (generation 0), 109 animals from generations 4 through 9, 29 animals from generation 15, and 30 industry sires. DNA extraction, library preparation, and sequencing for the 72 founding animals has been previously described in [5]. For the 36 animals in generations 4 and 5, genomic DNA was extracted from tail tissue using standard DNA extraction protocols, sheared to 300–500 bp using a Covaris S220 ultrasonicator (Woburn, MA, USA), and libraries prepared using the TruSeq DNA sample prep kit, version 2 (Illumina, San Diego, CA, USA) were paired-end sequenced (100 bp read length) on an Illumina HiSeq 2500 (Illumina Inc., San Diego, CA, USA) at DNA LandMarks (St.-Jean-sur-Richelieu, QC, Canada). Genomic DNA for the 30 AI sires, the 73 animals from generations 6 through 9, and the 29 animals from generation 15 was extracted using a Wizard SV96 genomic DNA purification kit according to the manufacturer’s protocol (Promega Corp., Madison, WI, USA). Genomic DNA was sheared to 350 bp (generation 15 animals) or 550 bp (AI sires and generation 6 through 9 animals) using a Covaris S220 ultrasonicator (Woburn, MA, USA), and libraries prepared using the TruSeq DNA PCR-Free prep kit (Illumina, San Diego, CA, USA). Libraries were paired-end sequenced (150 bp read length) on an Illumina NextSeq 500 (Illumina, San Diego, CA, USA) at USMARC. Bases of the paired-end reads for all sequenced genomes were identified with the Illumina BaseCaller, and FASTQ files were produced for downstream analysis of the sequence data.

Sequence data processing

The Trimmomatic software (Version 0.35) [6] was used to trim Illumina adaptor sequences and low quality bases from the reads. The remaining reads were mapped to the Sscrofa 11.1 genome assembly (NCBI Accession AEMK00000000.2) using Burrows–Wheeler Alignment (BWA, Version 0.7.12) [7] with the default parameters. All output SAM files were converted to sorted BAM files using SortSam from Picard (Version 1.1; http://broadinstitute.github.io/picard/), and duplicates in the BAM files were marked by applying MarkDuplicates from Picard. Genomic coverage for each of the BAM files was computed using Samtools (Version 1.3) [8].

Variant calling and filtering

Best practices established for the Genome Analysis Toolkit (GATK, Version 3.7) [9] were used to identify SNP. Briefly, RealignerTargetCreator and IndelRealigner from GATK were applied for local realignment of indels. Base quality recalibration was then performed using BaseRecalibrator from GATK, where the recalibration report was formed using the default setting for covariates and the NCBI dbSNP database (Build 150) as the database for known sites.

Multi-sample variant calling and genotyping was performed with the GATK UnifiedGenotyper, taking input from each of the 240 BAM files.

The UnifiedGenotyper output provides several metrics that assess the quality of detected variation, including quality (QUAL), quality by depth (QD), RMS mapping quality (MQ), Fisher strand (FS), haplotype score (HaplotypeScore), mapping quality rank sum test (MQRankSum), and read position rank sum test (ReadPosRankSum). The QUAL metric is a phred-scaled probability of the SNP being homozygous for the reference, where higher values indicate higher confidence. QD is computed by dividing the variant confidence (QUAL) by the unfiltered depth of all non-reference samples. FS is a phred-scaled P-value using Fisher’s Exact Test to identify strand bias. Higher FS values indicate stronger strand bias, i.e. likely false positives. The HaplotypeScore is a measure of how well the data in a 10-base window around the variant can be explained by at most 2 haplotypes. MQRankSum is a Wilcoxon Rank Sum Test that tests the hypothesis that the reads carrying the variant allele have a consistently lower mapping quality than the reads with the reference allele, while ReadPosRankSum is a Mann–Whitney Rank Sum Test that tests the hypothesis that instead of being randomly distributed over the read, the variant allele is consistently found more often at the beginning or the end of a sequencing read. In order to reduce the false discovery rate, variants were hard filtered to meet the following criteria, suggested by GATK documentation: QUAL > 30.0, QD > 2.0, MQ > 40.0, FS < 60.0, HaplotypeScore < 13.0, MQRankSum > − 12.5, and ReadPosRankSum > − 8.0.

Classification of variants

Filtered variants were classified according to their expected effect on gene function using snpEff (Version 4.3) [10] and NCBI (Release 106) annotation of the Sscrofa 11.1 build. Variants detected in coding sequence were classified into one of four functional categories: (1) loss-of-function SNP, which are high impact variants expected to disrupt the protein coded by the gene; (2) non-synonymous SNP, which are moderate impact variants that alter the amino acid sequence of the protein coded by the gene, (3) regulatory SNP, which occur in non-coding RNA and untranslated regions (UTR) of protein-coding genes; and (4) silent SNP, which are synonymous SNP and other low impact variants that do not affect the amino acid sequence.

Function of genes containing variation

Functions of genes containing detected variants were determined using the PANTHER classification system (Version 13.0) [11]. Enrichment analysis of gene function was performed using PANTHER’s implementation of the binomial test of overrepresentation. Significance of gene ontology (GO) terms was assessed using the default Sus scrofa GO annotation as background for the enrichment analysis. Data were considered statistically significant at Bonferroni corrected P-value < 0.05.

Results and discussion

Genomic DNA from 240 pigs, from a composite population at USMARC, was sequenced on the Illumina HiSeq and NextSeq platforms, generating approximately 72 billion paired-end reads (Additional file 1: Table S1). Sequence reads covered each pig’s genome at a mean of 13.62 fold (×) coverage. Individual coverage per animal ranged from 0.97× to 31.13×; 24 animals were covered at less than 3×, and 44 were covered at more than 20×.

A total of 26,850,263 high confidence SNP were identified. Our variants comprised 36.9% of the porcine SNP in the NCBI dbSNP database (Build 150). A total of 5793,049 of our variants were novel, i.e. not present in dbSNP, and 45,411 of them overlapped with the 61,596 SNP assayed by the PorcineSNP60 v2 BeadChip (Illumina Inc., San Diego, CA). While approximately 94% of the variants from our previous report were also identified in this study, only 85% of variants from our previous study were identified as high confidence variants (passing quality filtering). One factor contributing to this discrepancy was the use of the improved reference genome, where many of the gaps and misassemblies present in the Sscrofa 10.2 genome build were resolved. Furthermore, the addition of high coverage sequence for 168 animals improved the allele frequency calculations used during multi-sample variant calling by GATK’s UnifiedGenotyper software, thereby producing quality scores that are much more refined than those from our previous study.

The number of SNP detected in our pigs was similar to previous reports [12, 13]. The variants reported in dbSNP represent several diverse pig breeds and wild boars from different regions of the world. Therefore the detection of less than half of dbSNP’s variants in our animals is likely because those variants do not segregate in our population, which represents commercial swine germplasm, as a consequence of domestication and selection for lean meat production.

Variation was detected in 23,272 of the 29,847 annotated porcine genes (Additional file 2: Table S2). Most variation was detected in intergenic and intronic regions (Fig. 1). Analysis of expected effect on gene function predicted that 1162 protein-coding genes may be affected by 1729 loss-of-function or other high impact variants, 12,686 genes contained 64,232 moderate impact variants (nonsynonymous and other SNP altering the coded amino acid sequence), and 15,284 genes contained 90,939 low impact variants (SNP not expected to alter the amino acid sequence). Additionally, 250,403 variants that may be regulatory were identified, located in untranslated regions (UTR) of 15,739 genes.

Fig. 1
figure 1

Overview of the 26,850,263 SNP identified. Here, variant annotations have been collapsed so that each variant has only a single annotation. The category “Other” includes variants downstream of genes (up to 5 Kb), variants located in splice sites, and variants present in a gene but not in any of its transcripts

Genes involved in hydrolase activity, ion binding, and nucleotide binding were among those over-represented in the set of genes containing high impact variants (Table 1). Large numbers of genes in the nonsynonymous, synonymous, and regulatory classes resulted in large numbers of significantly over-represented GO terms. The most significant molecular function GO terms that were over-represented in genes with nonsynonymous variants were related to protein and ion binding (Additional file 3: Table S3). Similarly, GO terms related to protein binding, ion binding, and nucleic acid binding were among the top over-represented terms in the set of genes with synonymous and regulatory variants (Additional file 4: Tables S4, Additional file 5: Table S5).

Table 1 Significantly over- and underrepresented gene ontology (GO) terms in protein-coding genes with loss-of-function variants

Conclusion

Utilizing functional SNP could significantly boost the reliability of genomic predictions. Loss-of-function variants and others that disrupt or alter proteins coded by a gene, as well as those that regulate protein production, likely have a greater effect on phenotype than other types of variation. In this work, we identified 26,850,263 SNP, of which 316,364 were predicted to be of high to moderate impact. The present results provide an updated resource for functional variation in commercial swine germplasm.

Limitations

This work is only the first step in identifying functional genetic markers that influence economically relevant traits in the US swine industry. Additional work, including imputation of sequence variants throughout our population and developing assays to directly genotype functional variants, is needed to discover the extent to which these variants affect specific traits of interest. Continued examination of the variants identified in this work is expected to lead to the development of genotyping panels that will allow swine producers and breeders to be able to make more rapid genetic progress by including them into their selection decisions.