Background

Teat number, as an important reproduction trait, has a large effect on the lactation ability of the sows, thus teat number may directly affect the weight gain of piglets [1, 2]. However, the complex genetic architecture of teat number makes its molecular mechanism unclear. With the rapid development and application of high-throughput sequencing technology, genome-wide association analysis (GWAS) combined with molecular marker technology is considered a powerful approach for dissecting the genetic architecture of complex traits in livestock [3,4,5]. Several previous GWAS researches indicate that VRTN [6] and ABCD4 [7] as candidate genes may regulate teat upgrowth, but traditional GWAS based on single nucleotide polymorphisms (SNP) have accounted for only part of the total heritability [8].

Some of the missing heritability has been attributed to copy number variation (CNV) in humans [9]. CNV refers to the structural variations of DNA segments in the range of 50 bp to several Mb compared with a reference genome, and it is widespread distributed in the genome [10]. Overlapping CNVs are combined into big regions known as the copy number variation regions (CNVR) [11]. CNVs may lead to phenotypic variation and disease by altering gene structure, gene regulation, and exposure to recessive deleterious genes [12]. As researchers revealed the existence of large-scale copy number variation in the human genome [13, 14], CNV research was gradually carried out in various fields [15,16,17]. For example, Chen et al. [18] identified a CNV affecting the MSRB3 gene that increases pig ear size through the mechanism of Mir-584-5p; Wang et al. [19] found that the gain status at CNVR will decrease total number born and number born alive in large white sows. Overall, research into novel CNVs of pigs can capture part of missing heritability from SNP-based GWAS and explain more genomic structural variations.

In this study, multi-dataset GWAS were conducted for teat number in the French Yorkshire pig population. Our research aimed to identify genetic variants and candidate genes associated with teat number in pigs and to elucidate the potential molecular genetic mechanisms. Additionally, the genome-wide CNV detection provides a valuable complement for CNV map of the French Yorkshire pig genome.

Results

Phenotype and heritability statistics

In this study, we analyzed three traits including Total Teat Number (TTN), Left Teat Number (LTN), and Right Teat Number (RTN) in 644 French Yorkshire pigs (Table 1). The average numbers (mean ± standard deviation) of TTN, LTN, and RTN were 14.10 ± 0.92, 7.04 ± 0.50 and 7.06 ± 0.49, respectively. The coefficients of variation (C.V.) of all three traits were over 6.50%. Besides, we estimated the SNP based heritability (h2) for the three traits based on the 53,869 SNPs, and the genomic h2 ranged from 0.14 ± 0.06 to 0.17 ± 0.07, suggesting that teat number belongs to low heritability trait.

Table 1 The statistics for the phenotypes of teat number

Detection of genome-wide CNVs

A total of 8,746 CNVs (583 losses and 8163 gains) were detected using PennCNV software v1.0.5 [20], with an average of 13.48 per individual, ranging from two to 32 (Fig. 1a). Compared with losses, CNV gains occurred more frequently in individuals (Fig. 1a). The length of CNVs pigs ranged from 10.4 kb to 1.25 Mb with an average of 128.80 kb, and the median length of gains was 99.85 kb, which was longer than that of loss (82.63 kb, Fig. 1b). As shown in Table S1, all CNVs were merged into 429 CNVRs, including 103 losses, 296 gains, and 30 mixed events (gains and losses occurring in the same region). The total length of CNVRs was 66.78 Mb, occupying 2.95% of the total length of the pig autosomal genome (Sus scrofa version 11.1). The length of CNVRs ranged from 10.40 kb to 1.25 Mb with an average of 155.66 kb and the median length of gains (88.21 kb) was shorter than that of losses (120.86 kb) and of mixed events (301.20 kb, Fig. 1c). In addition, we observed that the majority of CNVs and CNVRs were under 500 kb in size.

Fig. 1
figure 1

Number and length of different CNV and CNVR type. a Number of CNVs of each type in each individual. b Length distribution for each CNV type. The gold line indicates the median length of each CNV type. c Length distribution for each CNVR type. The gold line indicates the median length of each CNVR type

Figure 2a and Table 2 illustrate the number and proportion of CNVRs distributed on autosomal. The number of CNVRs ranged from 11 in chromosome 18 (SSC18) to 36 in SSC2, accounting for 2.56% and 8.39% of the total proportion of CNVRs, respectively. In addition, CNVRs on SSC12 were the most dense, with an average distance between adjacent CNVRs of 2.12 Mb. We also found that the density distribution of the SNP and CNVR were remarkably consistent, suggesting that increasing SNP density contributes to the detection rate of CNVR [20] (Fig. 2a).

Fig. 2
figure 2

Comprehensive representation of CNVR maps, frequencies, and known/novel CNVR counts for French Yorkshire in the 18 autosomes. a Outer to inner circles: chromosome name; genomic location (in Mb); arrangement of CNVRs on the genome (gain in red, loss in green, mixed in blue); density histogram of CNVRs in 5 Mb bin (purple); density heatmap of CNVRs in 5 Mb bin; density histogram of SNPs in 5 Mb bin (blue); and density heatmap of SNPs in 5 Mb bin (blue). b Frequencies of CNVRs in the French Yorkshire population. c Known and novel CNVR counts categorized by frequencies

Table 2 Chromosome distribution of all 429 CNVRs in the pig autosomes

Gene content of CNVRs

A total of 1,558 genes from the Ensembl annotation of the Sscrofa 11.1 genome were identified to be overlapping with our detected 429 CNVRs, including 870 known genes and 688 unknown genes. Of these, 74.33% were protein-coding genes, 16.11% were long noncoding RNA (lncRNA), and others belonged to pseudogenes, small nuclear RNA (snRNA), microRNA (miRNA), small nucleolar RNA (snoRNA), processed pseudogenes, miscellaneous RNA (miscRNA), small cajal body-specific RNA (scaRNA), and T cell receptors (TR) V gene.

To further study the 1558 genes contended in CNVRs, we performed GO analysis and KEGG pathway analysis, as shown in Table S2. Accordingly, the GO analysis showed that genes of the terms of proteolysis, calcium ion binding, and endoplasmic reticulum are dominantly represented in the CNVRs (Fig. S1a), and the KEGG pathway analysis revealed that these genes are mainly represented in the pathway of endocytosis and estrogen signaling pathway (Fig. S1b). Compared with the reported quantitative trait loci (QTLs) in pigs, a total of 419 (97.67%) CNVRs included in or partially overlapping with 17674 QTLs (Table S3), which are associated with a variety of traits, such as teat number, average daily gain, and body weight. Among these QTLs, a number of 408 QTLs were associated with teat number.

Comparison of CNVRs detected in previous studies

The CNVR data set detected in this study was compared with the data of previous studies on CNVR [19, 21,22,23,24,25,26,27,28,29,30,31,32], as shown in Table S4. The results of Zheng et al. [31] have the highest overlap rate, with 609 CNVRs, while those of Wang et al. [22] have the lowest overlap rate, with only three CNVRs, indicating that there are still a large number of CNVs in the pig genome haven’t yet been discovered. Further, 189 CNVRs were newly identified, meanwhile, we detected 240 CNVRs overlapping with the previously reported CNVRs.

CNVR frequency in French Yorkshire population

The frequencies of CNVRs in the French Yorkshire population were also calculated and grouped into four categories: singleton (present in one individual), rare (present in more than one individual but with a frequency ≤ 0.01), low (0.01 < frequency ≤ 0.05), and common (frequency > 0.05), as shown in Fig. 2b. The singleton (151) accounted for 35.2% of all CNVRs, while there were 148 rare (34.5%), 78 low (18.2%) and 52 common (12.1%) CNVRs. As a result of the 189 novel CNVRs, 42.9%, 35.4%, 16.4% and 5.3% of them belong to singleton, rare, low and common, respectively. We found that the proportion of novel CNVRs was larger when the frequency of identified CNVRs were lower (Fig. 2c). The frequencies of CNVRs ranged from 0.15% (detected in one pig) to 71.76% (detected in 465 pigs) and was concentrated in singleton and rare frequency categories, indicating that CNVRs occur mostly in a few individuals and is hard to measure reliably [33]. Thus, we used 130 CNVRs with a frequency greater than 1% for later GWAS.

SNP-based GWAS results

PCA based on SNPs showed that there was no population stratification in this population (Fig. S2). Additionally, quantile–quantile (Q–Q) plots were used to illustrate the level of potential P value inflation (Fig. S3). The genomic inflation factors (λ) of GWAS ranged from 0.976 to 0.992, indicating that there is no obvious evidence of population stratification. Significant SNPs detected through 80K chips GWAS are presented in Fig. 3 and Table 3. We found that two SNPs (WU_10.2_5_76130558 and WU_10.2_5_76207514) located in SSC5 were simultaneously associated with all of the TTN, LTN, and RTN. The leading SNP (WU_10.2_5_76130558) explained 3.33%, 2.69% and 2.67% phenotypic variance for TTN, LTN and RTN, respectively. Subsequently, we performed a haplotype block analysis and found that these two SNPs are in complete LD and located within a 46 kb haplotype block (5: 73.19 Mb—73.24 Mb), which suggests that mutations near the potential QTL may have essential effect on teat number (Fig. S4).

Fig. 3
figure 3

Manhattan plots of 80K chip GWAS in this population. Manhattan plots consisted of TTN (a), LTN (b) and RTN (c), respectively. The x-axis represents the chromosomes, and the y-axis represents the -log10 (P-value). The solid and dashed lines indicate the 5% genome-wide (P = 9.94E-07) and suggestive (P = 1.99E-05) Bonferroni-corrected thresholds, respectively

Table 3 Significant SNPs associated with teat number in this population

Furthermore, GWAS conducted by imputed data revealed 30 significant variants for TTN (Fig. S5 and Table S5). The leading variant is 5_73264327_C located on SSC5 (P = 4.54E-07). Additionally, the 30 variants are located between 73.22 Mb and 73.30 Mb on SSC5, suggesting the presence of potentially significant variants affecting TTN. Notably, both the WU_10.2_5_76130558 and WU_10.2_5_76207514, identified by the 80K chip GWAS, are also within this region.

CNVR-based GWAS results

PCA based on CNVR also shows that there was no population stratification (Fig. S6). To further dissect the genetic basis of the teat number, CNVR-based GWAS were performed on 644 pigs with phenotypic records for TTN, LTN and RTN, respectively. Figure 4 shows the Manhattan plots for TTN, LTN, and RTN obtained from the association analysis, respectively. As shown in Table 4, we identified two CNVRs located on SSC1 and SSC15 that are associated with both TTN and RTN, demonstrating significant multi-effect associations. We also identified one CNVRs associated with LTN, located on SSC9.

Fig. 4
figure 4

Manhattan plots of CNVR-based GWAS in this population. Manhattan plots consisted of TTN (a), LTN (b) and RTN (c), respectively. The x-axis represents the chromosomes, and the y-axis represents the -log10 (P-value). The solid and dashed lines indicate the 5% genome-wide (P = 3.85E-04) and suggestive (P = 7.69E-03) Bonferroni-corrected thresholds, respectively

Table 4 Significant CNVRs associated with teat number in French Yorkshire pigs

Functional analysis of candidate genes

A total of 12 candidate protein-coding genes overlapped with significant CNVRs and located within a 1 Mb region surrounding the significant variants were detected based on the Sus scrofa 11.1. Subsequently, we employed the GeneCards, Mouse Genome Informatics databases, and conducted an extensive literature review to explore the functional roles of the identified genes. As a result, we identified TRIM66 and PRICKLE1 genes that exhibiting promising associations with teat number based on their known functions and previous research findings.

Discussion

Comparison of CNVR detected in this study with previous studies

In this study, a total of 429 CNVRs were detected using GeneSeek Porcine 80K SNP chip on 649 French Yorkshire pigs, which provides a supreme valuable supplement for the CNV map. The results showed that the gain CNVRs was higher than the loss CNVRs, which may be related to the stronger resistance of the genome to duplication than to deletion [34]. In addition, as our previous study, we also found that CNVRs occur more frequently in telomeres (seven of the top 10 largest CNVRs) [35], which ensure the stability and integrity of the genome and are associated with the replication of genetic material [36].

In previous studies, Wang et al. [27] performed CNVR detection in 12 pigs including nine pig breeds using 1M aCGH and obtained 758 CNVRs (Sus Scrofa 10.2), 20 of which are overlapped with our results. Xie et al. [26] detected 172 CNVRs (Sus Scrofa 10.2) in 125 pigs using the pig 60K SNPs chip, with only six overlapping with ours. As such, these differences may pertain to various platforms, detection software, algorithms, variety, and quantity of samples, etc. [37,38,39,40].

We also compared the sizes of CNVRs detected in different studies. The average length of CNVRs identified in current study is 128.80 kb. In contrast, previous studies using pig SNP chips reported CNVR sizes ranging from 148.99 kb to 1835.44 kb, while those using next-generation sequencing data reported sizes from 4.16 kb to 7.04 kb. The uneven distribution of SNP in Illumina high-density SNP genotyping arrays results in some small CNVs being easily missed in detection [20, 41]. So we concluded that increasing the marker density could improve the detection efficiency and accuracy of CNV, especially small fragment CNV.

Candidate genes associated with teat number

In this study, the SNP-based GWAS and CNVR-based GWAS did not detect overlapping signals. SNPs typically affect single nucleotide changes, whereas CNVs involve large segments of DNA with duplications or deletions that can encompass multiple genes or non-coding regions. Therefore, SNPs and CNVRs may regulate phenotypes by influencing different biological processes or pathways.

Despite this, through multi-dataset GWASs, we identified two candidate genes associated with teat number, namely TRIM66 and PRICKLE1. These two genes have known associations with various processes that could potentially influence teat number, such as breast cancer and abnormal vertebral development.

TRIM66 (Tripartite Motif Containing 66) is a protein-coding gene and a member of the tri-motif protein family. Ning et al. [42] integrated several datasets and software to perform a comprehensive analysis of the expression pattern in TRIMs and found that TRIM66 is significantly downregulated in breast cancer. Zhang et al. [43] also discovered that knocking down TRIM66 inhibits the proliferation of breast cancer cells. Similar to our findings, other GWAS studies on teat number have also identified several candidate genes with functions related to breast cancer [5, 44]. From a biological perspective, mammary gland development is fundamental to teat formation. Genes that affect breast cancer may also influence normal mammary gland development and function, thereby indirectly affecting teat formation. Although our current results require extensive validation, they hold promise for enhancing our understanding of the underlying regulatory mechanisms of teat number in pigs.

PRICKLE1, located on SSC5 at 73.71–73.83 Mb, approximately 416.48 kb from the variant region related to teat number in our results. PRICKLE1 encodes a nuclear receptor and is associated with abnormal vertebral development [45] and other processes. Previous research suggests that PRICKLE1 is involved in the Wnt signaling pathway [45], which is crucial for mammary gland and thymus development [46]. In previous studies, many strong candidate genes associated with teat number have also been demonstrated to be related to vertebral development, such as VRTN and MKX [6, 47]. Additionally, previous researches indicated that VRTN may be an important gene affecting teat number [48,49,50], but Zhuang et al. [7] showed that VRTN is not the most significant gene affecting teat number, and genetic heterogeneity of its insertion may exist in different populations. However, the signal of VRTN was not scanned in our analysis results, which may be due to the differences in varieties, the small size of the population and the high degree of inbreeding, leading to the low minor allele frequency value and finally was eliminated in the quality control.

Heritability and asymmetry of teat number in pigs

In this study, the heritability estimates for TTN, LTN, and RTN were 0.17 ± 0.07, 0.14 ± 0.06, and 0.17 ± 0.07, respectively, with LTN having a lower heritability than TTN and RTN. Similar results were observed in previous studies. Wei et al. [51] investigated teat number traits in a large sample of pigs, defining not only TTN, LTN, and RTN, but also the maximum per side of teat number (TNMPS), teat number symmetry (TNSYM), and the difference between sides of teat number (TNUMD). They found that the heritabilities for TTN and RTN were moderate (0.142 – 0.146), whereas those for LTN, TNMPS, TNSYM, and TNUMD were lower (0.048 – 0.097). After accounting for epistatic effects, the heritability for RTN decreased (0.047), and those for LTN and TNMPS were moderate (0.107 – 0.126). Additionally, studies by Li et al. [5] and Liu et al. [44] also found differing heritability estimates for left and right teat numbers, likely due to asymmetry. In our study, excluding individuals without recorded teat numbers, 68 samples exhibited asymmetry between the left and right teat numbers. Furthermore, some samples lacked records for one side of the teats during collection, contributing to discrepancies in heritability estimates.

Conclusion

In summary, we identified 429 CNVRs in the French Yorkshire pig population, covering approximately 2.95% of the total Sus Scrofa 11.1 autosomal genome length. These findings complement the CNV map of the Yorkshire pig genome. Our GWAS results revealed 32 variants and three CNVRs significantly associated with teat number. Two critical candidate genes, TRIM66 and PRICKLE1, were related to teat number. The combination of GWAS and multiple genetic mutations presents a valuable approach for enhancing genetics and analyzing the genetic mechanisms underlying pig breeding traits, and contributes to the field of genetic improvement in pig breeding.

Materials and methods

Ethics statement

The animals and experimental procedures used in this study follow the guidelines of the Animal Care and Use Committee of the South China Agricultural University (SCAU) (Guangzhou, China). The ethics committee of SCAU approved all animal experiments (SCAU#2014–0136). The experimental animals were not anesthetized or euthanized in this study. We confirmed that all methods are reported in accordance with ARRIVE guidelines (https://arriveguidelines.org) for the reporting of animal experiments.

Animals and phenotype

In this study, a total of 659 French Yorkshire sows were raised in four nucleus pig breeding farms of the Wens Foodstuff Group Co., Ltd. (Guangdong, China) between 2012 and 2016: Yuhe Farm 1 (YH1), Yuhe Farm 2 (YH2), Qingyuan Farm 2 (QY2), and Baizi Farm (BZ). All pigs were subjected to the normal management conditions. Additionally, the left teat number (LTN) and right teat number (RTN) was counted separately after birth, and total teat number (TTN) was the sum of teat numbers on both sides in accordance with our previous study [7].

SNP genotyping and quality control

DNA of each pig was extracted from ear tissue following the standard protocols. The quality of DNA in all samples (659 DNA samples) was measured by a light absorption ratio (A260/280 and A260/230) and gel electrophoresis. All DNA samples were diluted to a concentration of 50 ng/μL. The samples were genotyped with the GeneSeek Porcine 80K SNP chip, which contains 68,528 SNPs uniformly spanning the pig genome. The genotype quality control was conducted by PLINK v1.9 software [52]. SNPs located on the sex chromosomes or without positional information were excluded and a set of 62,078 SNPs from 659 high-quality genotyping samples (call rate ≥ 90%) were retained for CNV detection. Furthermore, to improve the accuracy of the GWAS results, variants with call rates < 90%, minor allele frequency < 5%, and P value < 10–6 for the Hardy–Weinberg equilibrium test were also excluded, and individuals with call rates larger than 95% were retained. After quality control, we enhanced the genotype data to the whole-genome sequence level using an imputation strategy. We employed the Swine Imputation (SWIM) Server tool [53] with default parameter settings to perform genotype imputation, bridging the target and reference genotype data. The reference haplotype panels were constructed from whole-genome sequencing data collected from 2259 pigs, representing 44 breeds. The genotype imputation accuracy consistently demonstrated a high average concordance rate exceeding 97%, a non-reference concordance rate of 91%, and an r2 value of 0.89. This ensured the reliability and robustness of our imputed data. We applied the same quality control criteria to the imputed data as we did to the 80K SNP chip data. After quality control, the final 50294 SNPs and 14656673 variants (including SNPs and INDELs) from 644 French Yorkshire sows were retained for subsequent analysis in 80K chip and SWIM imputed data, respectively.

CNVRs detection and functional enrichment analysis

The PennCNV software v1.0.5 [20] was utilized to identify CNVs by incorporating the SNP signal data of log R ratio (LRR) and B allele frequency (BAF) for each individual. The CNV calling and CNVR determined were carried out following our previous study [35]. In brief, raw CNV dataset, poor quality samples (n = 11) were filtered out with the following criteria: LRR > 0.3, BAF drift > 0.01, GC wave factor of LRR > 0.05, and then CNVs with consecutive SNPs \(\le\) 3 and length \(\le\) 10 kb were filtered to obtain more reliable CNV calls. Afterward, both BEDTools software v2.26.0 [54] and CNVRuler software v1.3.3.2 [55] were used to merge CNVs with at least 1 bp overlap in all samples to determine the CNVR [31]. Finally, 8,746 CNVs and 429 CNVRs were identified. KOBAS v3.0 [56] was used for Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis of genes which involved in CNVR map. In the enrichment analysis, the statistical method of Fisher's exact test was used to retain the GO entries and pathways with P value < 0.05. Besides, the CNVRs were mapped to pig QTL from the Animal QTL database (https://www.animalgenome.org/cgi-bin/QTLdb/index) [57]. To ensure the accuracy and validity of the GWAS results, we filtered the CNVRs dataset by removing those with frequencies smaller than 1% and 130 CNVRs were retained for GWAS analysis.

GWAS for teat number

To identify candidate variants associated with teat number, GWAS were performed with the CNVR dataset and the SNP dataset, separately.

In this study, GWASs were performed separately using a univariate linear mixed model implemented in the GEMMA software v0.98.1 [58]. Before GWAS, genomic relatedness matrix (GRM) and principal component analysis (PCA) were estimated using the GEMMA and GCTA software v1.92.4beta [59] based on SNP datasets. The statistical model used was as follows:

$$y=W\alpha + X\beta + u + \varepsilon$$

where \(y\) represents a vector of the phenotypic value for all animals; \(W\) is the incidence matrix of covariates, including fixed effects of the top five eigenvectors of PCA; \(\alpha\) represents the vector of corresponding coefficients including the intercept; \(X\) is the vector of all marker genotypes; \(\beta\) specifies the corresponding effect size of the marker; \(u\) is the vector of random effects, with \(u\hspace{0.17em}\sim \hspace{0.17em}MV{N}_{n} (0, \lambda {\tau }^{-\hspace{0.17em}1}K)\); \(\varepsilon\) is the vector of random residuals, with \(\varepsilon \hspace{0.17em}\sim \hspace{0.17em}MV{N}_{n} (0,{ \tau }^{-\hspace{0.17em}1}In)\); \(\lambda\) signifies the ratio between two variance components; \({\tau }^{-\hspace{0.17em}1}\) s the variance of the residual errors;\(K\) is GRM; \(I\) is an n × n identity matrix; \(MV{N}_{n}\) denotes the n-dimensional multivariate normal distribution. In the 80 K chip GWAS, the significance cutoff was defined as the Bonferroni method; a stringent genome-wide threshold (significant) as well as a more lenient chromosome-wide threshold (suggestive) were P < 0.05/N and P < 1/N, respectively, where N is the number of variants or CNVRs tested in the analyses. Based on human GWAS results, we set the genome-wide significance threshold and suggestive significance threshold for GWAS based on imputed data at 5.00E-8 and 1.00E-6, respectively [60, 61]. Besides, Haploview [62] was used for haplotype block analysis to detect linkage disequilibrium (LD) among SNPs, with settings to "Ignore pairwise comparisons of markers > 500 kb apart" and "Exclude individuals with > 50% missing genotypes" .

Candidate genes identification

In this study, the position of SNPs was based on the Sus Scrofa 11.1 version of the pig reference genome. We conducted a functional gene annotation to identify candidate genes using Mouse Genome Informatics website (https://www.informatics.jax.org/), GeneCards (http://www.genecards.org/), GeneCards website (http://www.genecards.org/) and Ensembl website (www.ensembl.org/biomart/martview).