Background

Apart from single nucleotide polymorphism (SNP), copy number variation (CNV) is another type of genetic marker that potentially affects animal phenotype [1]. These CNV consist of variation of genome sequences ranging from 50 to 5 million base pairs [2]. Compared to SNP, CNV show more drastic effects on gene expression and function, such as altering gene dosage, disrupting coding sequence, or perturbing long-range gene regulation [3]. However, until recently, the effective use of CNV as genetic markers for association with diseases and economic phenotypes has been impaired by difficulties in accurately detecting CNV and their boundaries, which vary among individuals [4, 5]. To date, especially within the livestock genomics field, most studies have focused exclusively on CNV discovery, and only a few studies and software have attempted to use comparative genomics methods to detect possible phenotype-related CNV in livestock using genome-wide association study (GWAS) [5,6,7].

Numerous methods have been used to detect CNV, ranging from traditional cytogenetic approaches, such as karyotyping and fluorescence in situ hybridization, to genome-wide in silico CNV prediction [8, 9]. Substantial improvements have been made recently to the accuracy and throughput of CNV discovery. For example, the application of microarray and next-generation sequencing technologies makes it possible to compare CNV among populations at a whole-genome scale, and sequencing methods offer opportunities to identify CNV in highly complex repetitive regions of the genome [10, 11]. However, neither sequencing methods are economical or widely used for large population analysis, including GWAS. Thus, the use of SNP chips is now commonplace for SNP-based GWAS studies, including studies of large populations of livestock. As a result, software packages have been developed and published for CNV prediction from SNP chip data, such as the SNP & Variation Suite (SVS; Golden Helix, Bozeman, MT), which facilitate a low-cost approach to detect common CNV within a population for GWAS analysis [5, 6].

Because Holsteins are the largest milk-producing dairy breed worldwide, GWAS examining reproduction, growth, milk production, and disease resistance traits among dairy cattle have been performed often in the Holstein breed. In these studies, SNP markers were used almost exclusively, whereas studies using CNV markers for GWAS are rare. For instance, Xu et al. [6] characterized 34 CNV significantly associated with milk production traits using a CNV inferred from the BovineSNP50 array (Illumina, San Diego, CA), and Durán Aguilar et al. [12] applied CNV-based GWAS for milk somatic score using the BovineHD Genotyping BeadChip (Illumina) containing over 777 K SNP [6, 12]. In a third study of Spanish Holsteins, 90 segregating CNV were identified from BovineSNP50 array hybridization signals and studied for their association with 7 production and conformation traits [7]. For other cattle breeds and livestock species, such as beef cattle and pigs, several GWAS have been conducted using CNV [5, 13].

Here, we applied CNV detection and GWAS analysis for 10 important production-related phenotypes of dairy cattle, including those related to feed intake, milk composition, female fertility, and productive life, as well as net merit. These studies were conducted using genotypes generated with the BovineHD Genotyping BeadChip (Illumina) from 528 Holstein cows. We identified 191 CNV based on size and frequency, and found 57 CNV associated with at least one of those phenotypes.

Methods

Animals

Cows (n = 528) used in the study were registered Holsteins from the Beltsville Agricultural Research Center Dairy Herd located in Beltsville, MD. Genotyping was performed using high-quality genomic DNA extracted from white blood cells or hair follicles of cows using Gentra Puregene DNA extraction kits (Qiagen, Valencia, CA) and the BovineHD Genotyping BeadChip (Illumina).

Phenotypes and PTA calculations

Phenotypes were available for analysis from 473 of the genotyped cows. Measurements for estimation of DMI and RFI were performed as described previously [14]. Production traits of cows were collected by Dairy One Cooperative Inc. (Ithaca, NY) using ICAR-approved methods and quality certification standards administered by the Council on Dairy Cattle Breeding (Bowie, MD). Conventional PTA for RFI and DMI were calculated as described in Lu et al. [15], and PTA for milk fat percentage, milk protein percentage, cow conception rate, heifer conception rate, daughter pregnancy rate, somatic cell score, net merit, and productive life were calculated as described in VanRaden and Wiggans [16]. De-regressed PTA (dPTA) were calculated according to the formula: dPTA = PTA / reliability as described by Garrick et al. [17] and used as the phenotype for further CNV GWAS, similar to Xu et al. [6]. Although matrix deregression based on pedigree structure is expected to more accurately remove the contributions of other relatives to the final deregressed evaluation [18], we used the simpler removal procedure.

CNV segmentation and genotyping

The SNP genotypes generated by the Illumina BovineHD Genotyping BeadChip assay with a call rate > 90% were used to detect the common CNV shared among Holstein cows using methods previously described by Zhou et al. [5]. Specifically, the DSF file was exported from GenomeStudio Software and the Log R Ratios (LRR) were imported into SVS 8.3.0 ([19]; Golden Helix Inc.). A total of 735,293 SNP were successfully mapped onto the 29 autosomes of the Bos taurus genome assembly UMD 3.1 [20]. The default GC correction file in the SVS software was used to correct the LRR waviness caused by the GC content. The multivariate method was used to define the CNV segments, employing the SVS default settings which are based on the expected CNV density (count and length) within a genome, as follows: (1) a maximum of 20 segments per 10 K markers; (2) at least 3 markers per segment; and (3) a maximum pair-wise segment P-value of 0.005. The 3 states with a comparatively strict threshold (segment mean ± 0.4) defined the CNV as 3 types of events (loss [− 1], neutral [0], or gain [+ 1]) across all of the samples.

PCA-corrected association testing

Multiple linear regression in an additive genetic model was used to identify the CNV significantly associated with 10 traits individually. Both FDR and 10,000 permutations were performed to correct for multiple testing. The model was as follows:

$$ {y}_i=\sum \limits_{j=1}^m{x}_{ij}{\beta}_j+{e}_i $$

where y i is the dPTA of the ith individual, x ij is the CNV genotype of the ith individual (gain, loss, and neutral as represented by 1, − 1 and 0), β j is the CNV effect, m is the number of CNV, and e i is the residual. To correct for batch effects/stratification, the Principal Component Analysis (PCA) option was selected, wherein the SVS software identified the optimum number of components based on a range from 1 to 20. See Golden Helix SNP & Variation Suite v8.3.0 Manual [19] for additional details on methodology. Significant CNV were identified after false discovery rate (FDR) correction (P < 0.05) [6].

Evaluation of QTL and genetic overlap of CNV

Gene annotations were downloaded from the Ensembl database [21] and QTL were downloaded from the animal QTL database [22] (based on the UMD3.1 bovine reference genome assembly). The overlaps between CNV and genes or QTL were detected using R3.3.3 GenomicRanges [23]. To detect potential genes affected by significant CNV, we define the ‘overlap’ as more than 1 bp in common between the CNV region and the genomic region (including the 5-Kb flanking regions both up- and downstream) of a given gene transcript. Because of overly large confidence intervals (reported by the Animal QTLdb) for some QTL, we filtered out the QTL with confidence intervals > 30 Mb and used a strict threshold to define the overlap as at least 50% of the CNV length being covered by the QTL.

CNV validation by qPCR

For each CNV selected for validation, the Illumina Bovine HD SNP chip probe sequences corresponding to SNP within the CNV region of interest were identified. The 400-bp genomic DNA sequence surrounding each targeted SNP (i.e., ± 200 bp of the SNP) was selected as the template sequence for PCR primer design using the NCBI Primer-BLAST primer designing tool [24]. The desired PCR product size was specified as 100 to 300 bp and the best primer pair was selected from the output. Primer information can be found in Additional file 2: Table S1. Reactions (25 μL) were performed in triplicate using SsoAdvanced Universal SYBR Green Supermix (Bio-Rad, Hercules, CA), 5 ng genomic DNA, and 400 nM each primer on the Bio-Rad CFX96 Touch Real-Time PCR Detection System. Amplification consisted of 95 °C for 1 min, followed by 45 cycles of 95 °C for 10 s, 58.2 °C for 10 s, and 72 °C for 30s. Amplification of the expected product size was confirmed by electrophoresis on a 2% agarose gel. Melting curve analysis following the final PCR amplification step was used to confirm presence of a single amplicon. Efficiency of PCR amplification for each target was determined from a 4-point standard curve using 1.5 to 24 ng of genomic DNA as template, and ranged from 99.3 to 107.4%. Linearity of all standard curves exceeded 0.995. The BTF3 gene and a common DNA sample from Hereford cow, L1 Dominette 01449 were used as references for all qPCR experiments. The 2-ΔΔCT method [25] was employed to analyze qPCR results wherein we defined 0 to 1.5 copies as a copy loss (− 1), > 1.5 to 3 copies as neutral (0), and > 3 copies as gain (+ 1).

Phenotype correlation analysis

Ten diverse phenotypes of RFI, DMI, milk fat percentage, milk protein percentage, somatic cell score, cow conception rate, heifer conception rate, daughter pregnancy rate, net merit, and productive life for 473 Holstein cows were selected for correlation analysis to understand their relationships to one another. Pearson product-moment correlations were computed between pairs of dPTA for all of the 10 phenotypes.

Results and discussion

CNV segments and genotyping

We detected 454 CNV from 528 Holstein cows where the CNV were characterized as 3 types (− 1, 0, and + 1). We filtered out likely false CNV using the strict threshold criteria of length ≤ 1 Mb and frequency > 0.5%. The resulting 191 common CNV had lengths ranging from 727 to 897,251 bp and frequencies ranging from 1.1 to 98.9% (Additional file 2: Table S2; Fig. 1). These CNV comprised 0.64% (16.2 Mb) of the whole autosomal length and were distributed differently among autosomes. The chromosome with the highest percentage of CNV length (3.8%) was Chr15. Bovine Chr12 was reported as having the highest percentage of CNV length in a previous CNV study [11], whereas in the current study, Chr12 ranked second highest for percentage of CNV length among all autosomes. We detected 164 genes that overlapped with the CNV (Additional file 2: Table S2). A total of 47 or 189 of our 191 CNV events were overlapped with Ben Sassi et al. and Xu et al. (Additional file 2: Table S2) [6, 7]. Gene ontology (GO) analysis results of genes overlapping with CNV are typically enriched in functions related to immunity or response to stimuli [26]. However, in our DAVID analysis using Fisher’s exact test, the CNV were highly enriched in GO terms related to development and growth, such as: multicellular organismal process, developmental process, and multicellular organismal development (Additional file 2: Table S3).

Fig. 1
figure 1

Distribution of CNV and significantly associated CNV on each Bos taurus autosome. Outer circle: distribution of 191 CNV where the height and color of histograms represent the variant frequency of each CNV (red, > 0.6; yellow, 0.3 to 0.6; and black, < 0.3); Inner circle: distribution of CNV significantly associated with at least 1 of the 10 production traits evaluated

Phenotype correlation analysis

Figure 2 shows results of the Pearson correlation analysis between each of the dPTA. As described previously [27, 28], the 10 dPTA generally showed moderate to strong correlation coefficients within each phenotypic category (i.e., feed efficiency and intake [RFI and DMI], milk composition and quality [fat percentage, protein percentage, and somatic cell score], fertility [cow conception rate, heifer conception rate, and daughter pregnancy rate], and composite evaluation [net merit and productive life]), but weaker correlation coefficients between different phenotypic categories. For example, RFI versus DMI was moderate (r = 0.60, P < 0.0001), milk fat versus protein percentage was moderate to high (r = 0.75, P < 0.0001), and the correlation coefficients of the 3 fertility traits were variable, ranging from 0.27 to 0.89. Net merit and productive life were highly correlated (r = 0.78, P < 0.0001) with one another, and both were moderately correlated with the fertility phenotypes. Somatic cell score showed weak correlation with other phenotypes, such as productive life (r = − 0.33, P < 0.0001) and net merit (r = − 0.26, P < 0.0001), and most coefficients were negative.

Fig. 2
figure 2

Pair-wise Pearson correlation coefficients for 10 dairy production-related phenotypes of interest. DPR: daughter pregnancy rate; CCR: cow conception rate; HCR: heifer conception rate; NM: net merit; PL: productive life; RFI: residual feed intake; DMI: dry matter intake; SCS: somatic cell score; FatPct: fat percentage; and ProPct: protein percentage

Significant CNV association analysis

Before applying the GWAS analysis, we excluded 2 CNV because they appeared in fewer than 5 individuals. The frequency of the remaining 189 CNV among the 473 cows ranged from 1.3 to 98.9%. In total, we detected 57 CNV associated after FDR correction (P < 0.05) with at least 1 of the 10 phenotypes (Additional file 2: Table S4; Fig. 3). These significant CNV were distributed among 25 autosomal chromosomes (Fig. 1), wherein no significant CNV were identified on Chr13, Chr20, Chr24, or Chr28. We did not observe strong bias due to the frequency of CNV present on a given chromosome on the number of significant CNV detected within that chromosome. Chr1 and Chr12 had the greatest CNV number for GWAS analysis, but based on CNV density (i.e., CNV count normalized by chromosome length), we did not find obvious enrichment of significant CNV on these 2 chromosomes. However, Chr7 was enriched with the greatest number (n = 6) of CNV significantly associated with the production traits of interest. The length of the 57 significant CNV ranged from 1.2 to 350 Kb and their frequencies ranged from 4.6 to 98.9%. We selected 7 significant CNV associated with multiple traits of interest or overlapping known genes or QTL for validation among 8 animals using qPCR. As most of the CNV (90%) were deletions, all 7 selected CNV were deletions. For 3 out of 7 cases, the PCR amplicon for CNV validation was within genes: CNV33 (RHOH), CNV46 (GRIK4), and CNV40 (AP2A2). The result showed 55.4% of the 56 qPCR results were consistent with the in silico prediction (Additional file 2: Table S5), which is typically in the range of 60 to 70% concordance [29]. Inconsistencies between qPCR and in silico prediction may occur due to multiple reasons, such as complex sequence of the genome, artificial assembly, probe bias, primer design, DNA quality, or other factors during array hybridization or PCR amplification.

Fig. 3
figure 3

Manhattan plots of the genome-wide association study results for 10 dairy production traits of interest. Negative log10-transformed P-values from a genome-wide scan (y-axis) are plotted against genomic coordinates on 29 Bos taurus autosomal chromosomes (x-axis). The solid horizontal line in each plot represents the threshold for significance based on a P-value < 0.05 after FDR correction. DPR: daughter pregnancy rate; CCR: cow conception rate; HCR: heifer conception rate; NM: net merit; PL: productive life; RFI: residual feed intake; DMI: dry matter intake; SCS: somatic cell score; FatPct: fat percentage; and ProPct: protein percentage

As expected, the significant CNV were associated with phenotypes of different categories, with the number of significant CNV associated with a given trait ranging from 1 (e.g., dry matter intake) to 19 (e.g., productive life; Additional file 2: Table S4). Phenotypes with higher correlation are more likely to share significant CNV. For example, we found 6 significant CNV (CNV11, CNV23, CNV29, CNV50, CNV121, CNV137) associated with both cow conception rate and daughter pregnancy rate; traits which were highly correlated. Similarly, another highly correlated phenotypic pair of net merit and productive life shared 10 significant CNV (CNV20, CNV40, CNV43, CNV45, CNV55, CNV62, CNV80, CNV94, CNV118, CNV121), whereas RFI and DMI shared only 1 significant CNV (CNV1). The most dissimilar phenotype in the correlation analysis, somatic cell score, was found to share no significant CNV with any of the other phenotypes evaluated. The relationships between phenotypes will help to better define potentially related markers as significant CNV appearing with both or all highly correlated phenotypes will be more reliable than those CNV associated with only 1 phenotype.

Genes and QTL overlapping with significant CNV

We detected 54 Ensembl gene ID overlapping with 20 significant CNV (Additional file 2: Table S4). Among them, CNV150 (Chr26: 25,719,640–26,013,587) overlapped with the greatest number of genes (n = 13) with all genes located inside the CNV region. There were 11 significant CNV (CNV147, CNV32, CNV33, CNV18, CNV66, CNV146, CNV151, CNV177, CNV34, CNV150, and CNV40) that overlapped with genes previously reported in cattle, or other species, to harbor deletions or duplications, and several others overlapping known gene regions. For instance, ELF3 was identified as a candidate gene for somatic cell score in a previous CNV-based GWAS study [12]. In the present study, CNV151 was associated with cow conception rate and is overlapping with the ELF3 gene. The product of the ELF3 gene is a transcription factor that functions in inflammation and epithelial cell differentiation, and may be involved in mammary gland development and involution [30], supporting its potential link to somatic cell score. How its function may relate to cow conception rate is not known.

Over half of the significant CNV were overlapping with at least 1 QTL (Additional file 2: Table S4), of which some were for traits that support our GWAS results. For example, CNV46 was significantly associated with heifer conception rate and it overlaps with a QTL for the circulating level of the reproductive hormone, LH. In addition, CNV23 was significantly associated with cow conception rate and daughter pregnancy rate, and it overlaps with a QTL for first-service conception rate. Lastly, CNV66 was significantly associated with RFI and overlaps with a QTL for average daily weight gain, and CNV7 was associated with somatic cell score and overlaps with multiple QTL for somatic cell score and clinical mastitis. Thus, further evaluation of these CNV is warranted in additional cattle populations for their association with these, or similar, production traits.

Characterization of shared significant CNV for phenotypes related to feed efficiency and intake

We focused our study on the 2 phenotypes of RFI and DMI related to feed efficiency and intake. There were 10 and 1 CNV significantly associated with RFI and DMI, respectively (Table 1). The CNV shared by RFI and DMI was CNV1 (Chr4: 108,225,979–108,252,635; P ≤ 0.001), which overlaps predicted olfactory receptor gene OR2A2 (LOC787786). However, only 6 cows were typed for CNV1 as neutral (0), while the remaining 467 cows were all typed as loss (− 1). Although the 6 neutral cows had a greater mean dPTA for both RFI and DMI, we identified some cows possessing the CNV1 loss type that had equal or even higher dPTA than the mean dPTA of the CNV1 neutral types (Additional file 1: Figure S1). Thus, particular attention must be given to significant CNV occurring at very low or high frequency when considering them as candidate markers for use in animal breeding. 

Table 1 Significant CNV from genome-wide association analysis for RFI and DMI traits among Holstein cattle

The CNV32 (Chr3: 14,876,353–14,882,132) associated with RFI also only showed loss and neutral types in our population, with a loss frequency of 81.0%. It is located 3.7 Kb upstream of the relaxin/insulin like family peptide receptor 4 (RXFP4; Chr3:14,871,054–14,872,623;strand: -) gene transcription start site, which is potentially affecting RXFP4 gene expression by altering its promoter region. Of interest, the ligand for RXFP4 is insulin-like peptide 5 (Insl5), which is produced by intestinal L cells in response to a reduction in feed intake, stimulating appetite [31]. In humans, RXFP4 was found to be significantly associated with obesity and body mass index [32]. Studies also reported that Insl5-RXFP4 signaling plays a role in glucose metabolism [33]. Thus, CNV32 may affect feed efficiency of cattle through an RXFP4-mediated pathway and provides an interesting candidate gene for further study in cattle.

Another significant CNV associated with RFI, CNV66 (Chr7:42,745,346–42,788,788), showed both loss and gain types with a total frequency of 54.6% within our Holstein population. Similar to CNV1, 2 olfactory receptor genes of predicted OR2T12 (LOC787816) and OR2AK2 are located inside CNV66. Olfactory receptors may influence feeding behavior, such as food preference and feed intake [34]. The variation of olfactory receptor gene copy numbers, theoretically, could affect their expression levels and impact RFI and DMI of Holstein cattle. These olfactory receptor genes serve as additional candidates for further study in the regulation of feed efficiency and feed intake of dairy cattle.

Conclusions

Our previous study showed that ~ 25% of CNV did not have a significant association with SNP; thus the effects of these CNV probably were not captured by tag SNP [6]. In this study, we performed a CNV-based GWAS for 10 important production traits and detected 57 CNV significantly associated with at least one of these production traits. Of particular interest regarding feed intake-related phenotypes, we detected 2 CNV associated with RFI located within RXFP4, encoding a G-protein coupled receptor thought to play a role in regulation of appetite and metabolism [31, 35], and 2 olfactory receptor gene regions, respectively. We also identified a single CNV within predicted OR2A2 strongly associated with both RFI and DMI. Our results identify CNV for genomic evaluation in Holstein cattle, and provide candidate genes contributing to variation in feed efficiency and feed intake-related traits.