Background

Female fertility is of great interest to the dairy industry because impaired reproductive ability can reduce the profitability of a dairy herd, particularly by increased expenses of additional inseminations, veterinary treatments and replacement cows [1, 2]. Selection to improve milk production traits in Holstein and Jersey cattle populations has led to a decline in fertility traits in the last few decades due to unfavourable genetic correlations between fertility and milk production [3]. Many countries have now included fertility in their national breeding goals [4, 5], however fertility related traits usually have low heritability estimates [3, 6, 7], and genetic improvement through traditional breeding programs is slow, although substantial genetic variation exists [8].

When heritability estimates are low for a trait, one could examine the non-additive part of genetic variation for opportunities to improve the trait of interest. Non-additive genetic variation is the result of allele by allele interactions and involves intra-locus (dominance) and inter-locus (epistasis) interactions. Pedigree based estimates of non-additive genetic variance for fertility related traits have been reported to be as large as or larger than additive variance [9]. Hoeschele [10] estimated additive and non-additive genetic variance for a number of cow fertility measures in US Holsteins and obtained broad sense heritabilities that were at least twice as large as narrow sense heritabilities, albeit with large standard errors. Similarly, Fuerst and Solkner [9] reported a higher proportion of the phenotypic variance explained by dominance and additive × additive epistatic effects than heritability estimated in the narrow sense for calving interval (CI). Druet et al. [11] observed similar values for additive and dominance variances in analyses of fertility traits for Austrian Simmental and Brown Swiss dairy cattle and Palucci et al. [12] estimated non-additive genetic effects of sizable magnitude for a number of fertility measures in Canadian Holstein heifers and cows and suggested including non-additive genetic effects in models for estimating genetic merit of animals.

The prediction of non-additive genetic effects is not a trivial task and requires complex statistical and computational methods [13]. In traditional genetic evaluation methods, pedigrees are usually not informative enough to accurately estimate non-additive genetic effects for each individual and in many cases these effects are confounded with other non-genetic effects such as common environment, or maternal effects that may lead to over-estimation [12, 14]. The use of genomic data instead of pedigree information has the potential to overcome these problems when both phenotypes and genotypes for individuals in a given population are known. Availability of genotypes coupled with phenotypes has led to a renewed interest in the estimation of non-additive genetic variance. Sun et al. [15] showed that dominance variance can account for up to 7 % of total phenotypic variance of yield traits in dairy cattle and including additive and dominance effects in the model fits data better than including only additive effects. Ertl et al. [16] obtained larger estimates of dominance variance for milk production and conformation traits in Fleckvieh cattle such that the ratio of dominance variance over total genetic variance ranged from 3.3 % to 50.5 % in their study.

Over the last decade, high throughput genotyping has provided a valuable source of information to study the relationships between phenotypes and genotypes in livestock breeding [17, 18]. Access to large SNP arrays at an affordable price has made genome-wide association studies (GWAS) a common practice. GWAS use linkage disequilibrium (LD) between DNA markers and QTL to identify variants associated with traits and it can be used to map QTL regions throughout the genome [19]. Genome-wide association studies can be used to estimate both the additive and non-additive effects of genetic markers, but most published GWAS for dairy cattle to date have focused on additive effects of genes while non-additive interactions are generally neglected, with a few exceptions (e.g. [15]). This might not be an appropriate assumption since the modes of biological actions are often more complicated than can be explained by simple additive models [20]. Known genotypes of individuals are more informative than pedigree based methods, especially for estimating dominance effects. The disadvantage comes from the large increase in dimensionality generated by including all potential epistatic interactions. Testing combinations of all possible allelic interactions would be ideal, however it is not always computationally feasible and the results might not be interpretable. An alternative approach to decrease the dimensionality is to perform a filtering step in which a set of variants or genes are selected and subsequently tested for epistatic effects [2123].

The objective of this study was to detect chromosomal regions with additive and non-additive genetic effects for calving interval and milk yield (MY) using a Holstein discovery population. We then attempted to validate these associations in an independent Jersey population of cows. The benefits and limitation of accounting for non-additive effects in genetic analyses are discussed, with examples from the present study.

Results

Additive marker effects

Manhattan plots of all additive SNP effects for MY and CI in study populations are presented in Figs 1 and 2. There were a large number of SNPs for MY that reached the 5 % genome-wide significance level after the Bonferroni correction in both Holstein (P < 1 × 10−7) and Jersey (P < 1 × 10−5) populations. However, for CI (Fig. 2) very few SNPs passed this threshold, therefore lower suggestive thresholds in Holstein (P < 0.0001) and Jersey (P < 0.01) cows were set to identify potential associations. Significant associations for MY were found on BTA 5, 14 and 20 whereas suggestive additive SNP effects associated to CI were observed on BTA 5, 9, 11, 18, 22, 27, 29 and X chromosome.

Fig. 1
figure 1

Distribution of additive SNP effects for milk yield. Manhattan plot of all additive SNP effects for milk yield in discovery and validation populations with chromosome number on horizontal axis and –log10(P-value) on vertical axis

Fig. 2
figure 2

Distribution of additive SNP effects for fertility. Manhattan plot of all additive SNP effects for calving interval in discovery and validation populations with chromosome number on horizontal axis and –log10(P-value) on vertical axis

In the association analysis of MY, 715 SNPs were significant (P < 1 × 10−7) in the Holstein analysis (Table 1). The number of SNPs which validated in the Jersey population at the probability threshold of P < 1 × 10−5 was 93 in the individual SNP validation but this increased to 413 in the segment validation (Table 1). Out of the 93 individually validated additive SNP effects on MY, 64 had the same direction in both discovery and validation populations. False discovery rates (FDR) for MY were calculated to be very close to zero in all cases. For CI, 136 SNPs were significant (P < 0.0001) in the Holstein set which corresponds to an estimated FDR of 30 %, while only 5 of these were found to be significant (P < 0.01) in the Jersey population used for validation, with an estimated FDR equal to 26 % (Table 1). All of the 5 validated effects were in the same direction in Holstein and Jersey cows. In segment SNP validation for CI, the number of significant SNPs (P < 0.01) was 73 with a FDR of 1 %.

Table 1 P-value thresholds, number of SNPs found to be additively significant and corresponding false discovery rates (FDR) for milk yield (MY) and calving interval (CI) in discovery and validation populations

Dominance marker effects

The Manhattan plots of all dominance SNP effects for MY and CI are shown in Figs. 3 and 4 respectively. No dominance effects were found to be significant for either MY or CI at the genome-wide threshold of P < 1 × 10−7 (5 % Bonferroni corrected). Therefore for both traits, SNP effects were tested with a suggestive less stringent threshold of P < 0.0001 in Holstein population and validated in Jersey cows at a threshold of P < 0.01.

Fig. 3
figure 3

Distribution of dominance SNP effects for milk yield. Manhattan plot of all dominance SNP effects for milk yield in discovery and validation populations with chromosome number on horizontal axis and –log10(P-value) on vertical axis

Fig. 4
figure 4

Distribution of dominance SNP effects for fertility. Manhattan plot of all dominance SNP effects for calving interval in discovery and validation populations with chromosome number on horizontal axis and –log10(P-value) on vertical axis

The magnitude of significant dominance effects were smaller than those for additive effects especially for MY. This was confirmed with large FDRs, such that in the discovery analyses of both traits, these values were calculated to be more than 100 % (Table 2). Forty SNPs were significant (P < 0.0001) in the Holstein discovery population for MY, but only 1 of these was validated (P < 0.01) in Jersey cows with different signs observed in the discovery and validation analyses and with a FDR of 39 % (Table 2). The number of validated SNPs increased to 21 using the segment validation approach (P < 0.01) with a FDR equal to 1 %. For CI, 36 SNPs were found to have significant (P < 0.0001) dominance associations in the Holstein discovery set (Table 2). Of these, 3 (1 with same direction) and 10 SNPs were validated respectively in individual (FDR = 11 %) and segment (FDR = 3 %) validations in the Jersey population when a threshold of P < 0.01 was applied. The validated SNPs with suggestive significant dominance effects on MY and CI were detected on 5 (BTA 2, 3, 5, 26 and 27) and 7 (BTA 1, 2, 3, 7, 23, 25 and 28) chromosomes, respectively.

Table 2 P-value thresholds, number of SNPs with significant dominance effects and corresponding false discovery rates (FDR) for milk yield (MY) and calving interval (CI) in discovery and validation populations

Epistasis interactions

There were 255,255 and 9,180 pairwise interaction effects included in the epistasis analyses for MY and CI respectively. The SNPs were selected where they had significant additive effects in the discovery population of Holsteins (Table 1). We only performed individual validation for epistatic analyses, such that if an interaction between two SNPs was significant in discovery population we checked for its significance also in validation set.

Similar to the additive analysis, a larger number of pairwise interactions were found to be statistically significant for milk yield compared with fertility (Table 3). However, since all of the SNPs that had validated interactions for MY were located at the beginning of BTA 14 and near diacylglycerol O-acyltransferase 1 (DGAT1) gene, we suspected that these interactions may have been due to the DGAT1 mutation effect [24]. Therefore, the epistatic model was extended to include the DGAT1 effect to see if any of the interactions remained significant. We did this by including an additional SNP effect in the model, this SNP was the highest peak in the DGAT1 region. The absence of significant interactions in this region after including the SNP in high LD with the DGAT1 effect in the model suggests that the identified significant pairwise interactions were picking up the DGAT1 effect by creating haplotype like combinations. That is, the linkage disequilibrium of SNP allele combinations with the DGAT1 mutation was higher than for the individual SNP.

Table 3 P-value thresholds, number of significant pairwise additive × additive interactions and calculated false discovery rates (FDR) for milk yield (MY) and calving interval (CI) in discovery and validation populations

Five additive × additive interactions were found that had significant (P < 0.0001) effects on CI in Holstein analysis with a FDR of 18 %, however none of them was validated (P < 0.01) in the Jersey cows, so we did not report them here.

Discussion

Additive and non-additive genetic variants influencing MY and CI were identified using a large GWAS applied to phenotypes and genotypes of females of two breeds of dairy cattle. We found several significant additive and dominance associations in a discovery population of Holstein cows which were confirmed in Jersey cows. Although many of the additive QTL effects identified and validated here overlapped with previously reported genomic regions in the literature, this study is novel in identifying a number of QTL regions associated with dominance effects on MY and CI.

We discovered more additive associations for MY than for CI, which was likely a consequence of the higher heritability of MY. Calving interval is the most widely used measure of female fertility in national genetic evaluations worldwide [3, 8], but it has a low heritability. Besides, CI suffers from censored records where data from cows unsuccessful to calve again are excluded from evaluations, or a maximum arbitrarily value is assigned which impairs the predictive ability of models not accounting for censoring [25]. These contribute to the low power in detecting underlying additive genetic variants. Other detailed measures of fertility as well as potential biomarkers linked with female reproduction [8, 26] which have higher heritabilities could provide a better insight of the genetic underlying female fertility in future.

High false discovery rates in identifying SNPs with dominance effects on both MY and CI in the discovery population of Holsteins (Table 2) indicates that the identified associations might only serve as a reference for future studies. Increasing the number of observations would improve power.

Locating QTL regions

Additive effects

Four regions were detected that had additive associations with MY, of which 2 were on BTA 5 and 1 each on BTA 14 and BTA 20 (Table 4). The longest (~4 Mbp) region was on chromosome 14, extended from 1.4 to 5.3 Mbp and comprised 96 % of the significant additive SNP associations. All of the individually validated SNPs for MY, except 1 on BTA 5, were also found within this region, where 65 of them associated with 31 genes (Additional file 1: Table S1). A cluster of genes with suggested effects on all milk production traits in dairy breeds have previously been identified in this region (e.g. [2729]). The most significant association in this interval in both Holstein and Jersey animals was SNP rs109421300 (~1.8 Mbp) located within an intron of the DGAT1 gene. This marker also had the largest additive effect on MY and explained highest proportion of phenotypic variance (5.694 %) by additive effects for MY. The effect of DGAT1 on several production traits including MY has been previously demonstrated in several studies [24, 30, 31].

Table 4 Boundaries of the validated regions with significant additive effects on milk yield and the most significant SNPs within the identified regions with their associated genes in discovery and validation populations

Chromosome 5 contained 2 significant regions for MY (94.5 to 95.0 Mbp and 96.9 to 97.9 Mbp) which extended beyond the previously reported significant regions on this chromosome for milk production traits [32, 33]. Wang et al. [32] reported a significant region between 91.2 Mbp and 97.1 Mbp for milk fat percentage, with the most significant SNP located in an intron of the epidermal growth factor receptor pathway substrate 8 (EPS8) gene. This gene has also been reported by Raven et al. [33] as influencing milk yield in Australian Holstein and Jersey populations. In our study, the two aforementioned regions contained the most significant SNPs rs136816685 at 95.0 Mbp (Jersey) and rs110729080 at 97.4 Mbp (Holstein) respectively inside intronic regions of protein tyrosine phosphatase, receptor type, O (PTPRO) gene and G protein-coupled receptor, family C, group 5, member A (GPRC5A) gene. PTPRO is located 18 Kbp downstream of EPS8 so it is likely that the same QTL affecting milk production traits is responsible for detected associations in these studies. A candidate gene in the other interval, GPRC5A, is associated with signal transduction between cells and has been reported as having differential expression (up regulation / turning on) during the onset of lactation in bovine mammary tissue [34].

The region on BTA 20 for MY was located on the middle of this chromosome which was strongly suggested as having a QTL affecting milk production traits [35, 36]. A mutation in the Growth hormone receptor (GHR) gene has been suggested as underlying the QTL in this region [35].

Seventeen significant regions suggesting several genes with additive effects on CI were discovered in this study (Table 5). Chromosome 18 had the highest number of significant regions for CI but BTA 9 and 27 contained more significant associations than other chromosomes with each having about 22 % of significant SNPs. The most significant additive effect for fertility was SNP rs41996522 (−log10(P) = 6.076) located on BTA 22 which also explained the highest proportion (0.137 %) of phenotypic variance for CI by additive effects. Nevertheless, all of the individually validated SNPs were found within the region on X chromosome extending from 139.2 to 139.8 Mbp.

Table 5 Boundaries of the validated regions that are additively significant on calving interval as well as the most significant SNPs and their associated genes within these regions in discovery and validation populations

Most of the identified QTL regions for CI in this study have been previously reported in the literature. Chromosome 18 has been largely investigated in search for QTLs affecting reproduction traits in dairy breeds [3739]. Sahana et al. [39] found strongest marker associations for some direct calving traits on this chromosome on a region ranging from 55.2 to 60.0 Mbp in Danish and Swedish Holstein cattle. Their detected interval covers one of the identified QTL regions in the present study (57.1 – 58.1 Mbps). The most strongly associated SNP in this region found by these authors, which was previously reported by Cole et al. [17] as having largest effect on several traits including calving ease, was located in an intron of the sialic acid binding Ig-like lectin-5 (SIGLEC5) gene. SIGLEC5 was suggested to have a role on the initiation of parturition in Human [40], hence suggested influencing fertility in cattle [17]. Although we have not identified the same gene here, but SIGLEC5 is positioned in our reported QTL region, so it may be possible that the described intervals are harbouring the same QTL.

Hoglund et al. [41] performed a large GWAS in a population of Nordic Holsteins for eight female fertility traits and validated their results in independent populations of Nordic Reds and Jerseys. They found several significant SNP associations for number of inseminations per conception in heifers and cows (BTA 5 and 11), Nordic female fertility index and length of the interval from calving to first insemination (BTA 9) and 56-day non-return rate in cows and heifers (BTA 27) which all overlapped with QTL regions for CI identified in our study.

Studies for detecting associations on chromosome X for fertility related traits are scarce in the literature and this chromosome is generally discarded in GWA studies mainly due to the use of mixed-sex observations. All of the SNPs that were individually validated in Jerseys for CI were found on the X chromosome, so future studies including data on this chromosome are suggested to search for QTL affecting fertility traits.

Dominance effects

Of the 7 validated regions with dominance effects on MY, 3 were on BTA 26 and 1 identified on each of BTA 2, 3, 5 and 27 (Table 6). Chromosome 3 (97.9 – 98.8 Mbp) contained more significant SNPs (59 %) than other chromosomes but the identified region on BTA 5 (71.8 Mbp) encompassed the only individually validated SNP (rs110106971) with dominance effects on MY which happened to be within an intronic region of the synapsin III (SYN3) gene. The ATP/GTP binding protein like-4 (AGBL4) gene was associated with both of the most significant SNPs in Holstein discovery (rs43361287) and Jersey validation (rs43363311) populations on the identified region on BTA 3 (97.9 – 98.8 Mbp). AGBL4 was reported as a gene under positive selection in the dual purpose (milk and beef) Normande breed cattle [42]. Among the candidate genes on the 3 identified regions on BTA 26, phospholysine phosphohistidine inorganic pyrophosphate phosphatase (LHPP) gene was reported as a differentially expressed gene in mammary gland of Holstein-Friesian dairy cows affected by the polymorphism in DGAT1 [43].

Table 6 Boundaries of the validated regions with significant dominance effect on milk yield as well as the most significant SNPs and their associated genes within these regions in discovery and validation populationsa

There were 8 regions identified for CI (Table 7) suggesting some genes influencing fertility by dominance gene actions. The region on BTA 2 extended from 80.2 to 80.7 Mbp contained both of the most significant SNPs in Holsteins (rs41591067) and in Jerseys (rs133868000) within intronic regions of Myosin IB (MYO1B) gene. MYO1B associated with one of the individually validated SNPs (rs133868000) with dominance effect on CI in our study and was reported to have differential expression in in vitro culture of mouse blastocysts in suboptimal conditions [44]. Both of the top SNPs in Holstein (rs134910746) and Jersey (rs29020504) populations on the interval extended from 15.8 to 16.0 Mbp on BTA 3 were found to be inside intronic regions of potassium intermediate/small conductance calcium-activated channel, subfamily N, member 3 (KCNN3) gene. KCCN3 plays a key role in fluid secretion within the bovine oviduct which is essential to provide an appropriate environment for gamete maturation, transport, fertilization and early embryo development [45]. It has also been shown that KCNN3 is differentially expressed between oocytes and granulosa cells (GCs) during development of the sheep ovarian follicle [46]. Similarly, human orthologue of KCNN3 was reported as a gene associated with preterm birth [47].

Table 7 Boundaries of the validated regions with significant dominance effect on calving interval and the most significant SNPs with their associated genes within these regions in discovery and validation populations

Implications

One of the critical parameters determining the power of GWAS is the amount of LD between the observed SNP and the unobserved causal variant. In fact, the success of a GWAS in identifying QTLs with additive effects is controlled by r 2 (r is the correlation between genetic marker and causative mutation) while detection of dominance or pairwise additive by additive effects depends on r 4. This indicates there is a much higher reliance on LD when searching for non-additive effects compared to additive effects, if LD between the markers and QTL is incomplete [48]. This was reflected in results of the present study in which a larger number of markers with additive effects were identified than the markers with dominance and epistasis effects for both traits under investigation.

Although we validated some of the pairwise (putative epistatic) interactions for MY across breeds, a subsequent analysis that included the effect of the DGAT1 gene, which has known effect on the trait [24], removed all of the detected associations. This suggests that the identified epistatic associations are actually haplotype effects that are in higher LD with the DGAT1 mutation than the individual SNPs. This illustrates a problem with testing for epistatic interactions with common SNPs in imperfect LD with causative mutations; SNP by SNP interactions can describe haplotypes that are in higher LD with the causative mutation than the individual SNP, and are therefore significant when there is no true epistatic effect present. Putative epistatic interactions between common SNP should therefore be treated with caution.

The standard in reporting GWAS results is validation and before genotype-phenotype relationships can be used in selection decisions, they should be replicated in an independent population to confirm generalized effects in multiple populations [49]. Validation of GWAS results across breeds can refine QTL regions to narrower intervals [33] and is powerful in identifying positional candidate genes. This is because the extent of LD across cattle breeds is limited in contrast to within a breed where considerable LD can be maintained in intervals up to 1 Mbp as a result of a relatively small effective population size [50]. We validated a lower number of non-additive genetic associations than additive effects such that very few dominance effects for MY and CI were confirmed and no epistasis effect was common across Holstein and Jersey cows for CI. This trend is in agreement with the statement that the higher dependence on LD in searching for dominance and epistatic effects compared to additive effects significantly decreases the chance of validating associations in two independent populations for non-additive effects of the markers [48]. Failure to validate many associations could also be related to the genetic differences between breeds, or even populations, and the fact that many QTLs are only segregating in one breed and not in the other [33, 41]. In situations like these, the validation may not be successful in confirming a true positive that exists in one breed but not shared between breeds, even if the power for detecting associations in both populations is high. Detecting marker effects and validating them on very dense genotypes or sequence data may help to overcome these problems.

Quantifying non-additive gene actions requires phenotypes that are measured on genotyped individuals. Daughter yield deviations are performance averages typically over hundreds of daughters. However, by definition they cannot capture the dominance deviation in daughters’ phenotypes, so they are not useful for estimating non-additive effects of genes. The only way to estimate non-additive genetic effects in dairy cattle is through large datasets of dairy cows, as they express almost all of the economically important traits in dairying.

The number of cows with available genotypes in dairy cattle is less than the genotyped bulls but the trend is towards genotyping more cows. Moreover, the availability of genotyped ancestors of cows enables inferring genotype probabilities for cows that can be then used in estimation of non-additive effects [51]. Nonetheless, accurate estimation of non-additive genetic effect requires more data than the data needed for additive effects to maintain the same power in analysis [20, 48] and this is exacerbated by the necessity of having observations in all three genotype classes. We removed a large number of SNPs with minor genotypic frequency <0.01 here when included biased estimates of dominance effects were obtained because of a missing class of SNP genotype. A large number of genotyped cows with a high resolution (dense) genome content or whole genome sequence could solve this problem.

Conclusions

We identified and validated a small number of SNPs with suggested dominance effects on MY and CI in Australian Holstein and Jersey cows. Given our results, identifying non-additive gene actions using single SNP regression in a GWAS setting will require very large datasets to capture the likely very small individual non-additive genetic effects. Alternative approaches that simultaneously use all genomic information in Bayesian methods or in terms of genomic relationship matrices in a BLUP seem more appropriate. As the number of genotyped animals is steadily increasing in dairy cattle breeding, incorporating whole genome sequence could reduce the problem of high dependency on LD in detection of non-additive genetic effects and is suggested as a future approach.

Methods

Data

The original dataset contained 9,159,969 calving interval and 305 d milk yield records of 3,513,757 Australian Holstein and Jersey cows that were in lactation between 1980 and 2011 and recorded by the Australian Dairy Herd Improvement Scheme (ADHIS; Melbourne, Australia). Not all contemporaries in a given herd-year-season were genotyped, and therefore, to accurately remove contemporary group effects from the phenotypes, the records were pre-corrected for the effects of age at calving, herd-year-season, parity and month of calving using the full set of records. The residuals from this model were used as the response variable in GWAS analyses for the genotyped animals. Records that did not have both genotype and phenotype data were discarded, so that the final datasets included 23,198 and 11,091 MY and CI records respectively for 7,055 Holstein and 3,795 Jersey cows. Distribution and descriptive statistics of the Holstein and Jersey populations’ phenotypes are shown in Fig. 5.

Fig. 5
figure 5

Summary statistics of data. Distribution and summary statistics of Holstein and Jersey datasets for milk yield and calving interval

The genotyped animals were part of the Australian genomic reference population that were previously selected for typing based on completeness of their phenotypic data and having sires in the national Australian genomic reference population. Animals were genotyped with Illumina BovineSNP50 v2 BeadChip (Illumina, San Diego, CA, USA) and their 50 K SNP data were imputed to the high density (HD) 800 k panel by using another set of 1,785 animals (including Holstein and Jersey key ancestor bulls and heifers) with genotypes from the Illumina BovineHD BeadChip (Illumina, San Diego, CA, USA). The imputation was performed with BEAGLE 3.1 [52]. Quality control checks applied on both genotypic data sets prior to the imputation step were as described in Erbe et al. [53]: only SNPs with Gen-Call score > 0.6 and call rate > 90 % were kept; mitochondrial, unmapped and duplicate map position SNPs were removed; and a minimum number of 10 copies for the minor allele was required for each SNP to be included in the data set. This resulted in 45,754 and 632,003 SNPs for the 50 K and 800 K panels, respectively. A further 223,748 SNPs were removed from HD SNP panel owing to a genotype class had a frequency <0.01 in both Holstein or Jersey animals. This was done because the scope of this study focused on both additive and dominance effects of the SNPs and required enough observations in all three classes of SNP genotypes. Finally, the remaining 408,255 SNPs were used for the GWAS. The positions of SNPs on the genome were based on UMD 3.1 genome assembly [54] and the coding of SNP genotypes were 0, 1, and 2 respectively for aa, Aa and AA allele combinations. More details on filtering and imputation of SNPs can be found in Erbe et al. [53] and Hayes et al. [55].

A total of 408,255 SNPs included in the analyses were distributed over 29 BTA as well as the X chromosome, but the distribution was uneven with an average genome-wide distance of 6.52 Kb between SNPs, and a minimum and maximum average distance of 5.40 Kb and 15.37 Kb on BTA 25 and on X chromosome, respectively.

Statistical model

A mixed linear model was implemented to test for associations between genetic markers and phenotypic values of both traits (CI and MY) for each breed separately. SNP effects were calculated one at a time using single SNP regression procedure. The model in matrix notation was:

$$ \mathbf{y}={1}_{\mathbf{n}}\mu +\mathbf{X}\mathbf{b}+\mathbf{Z}\mathbf{u}+\mathbf{W}\mathbf{p}\mathbf{e}+\mathbf{e} $$
(1)

where y is a vector of phenotypes (CI or MY), 1 n is a vector of ones, μ is the population mean term, b is the vector containing relevant additive, dominance or epistatic marker effects as specified below, u contains polygenic effects assumed to be distributed as u ~ N(0, A σ 2 g ) with A being the pedigree based numerator relationship matrix, pe is the vector of random permanent environmental effects with pe ~ N(0, I σ 2 pe ) and e is a vector of random residual deviates supposed to be distributed as e ~ N(0, I σ 2 e ). X is a design matrix allocating records to markers effects (additive, dominance or epistatic) and Z and W are incidence matrices for the random effects. σ 2 g , σ 2 pe and σ 2 e are polygenic additive, permanent environmental and residual variances, respectively. The pedigree files included 29,042 and 15,977 animals for Holstein and Jersey with corresponding average generations of seven and six.

The original marker covariates (0, 1 or 2) were corrected for allele frequencies [13] to build X, so that x ij(a) = {−2p, (q − por 2q} for additive effects of aa, Aa and AA genotypes, respectively, with p and q being the allele frequency of A and a allele at marker j in the population. For dominance effects, aa, Aa and AA genotypes were coded as x ij(d) = {−2p 2, 2pq and − 2q 2}. Then, the contents of Xb varied with the type of the genetic effect being tested. For additive effects, Xb = {x ij(a) a j }, where a j is the corresponding additive effect. For dominance, Xb = {x ij(a) a j  + x ij(d) d j }, where d j is the dominance effect of marker j. In the epistasis model, \( \mathbf{X}\mathbf{b}=\left\{{x}_{ij(a)}{a}_j+{x}_{ij{}^{\prime }(a)}{a}_{j{}^{\prime }}+{x}_{ijj{}^{\prime }(e)}{a}_{jj{}^{\prime }}\right\}, \) where \( {x}_{ij{j}^{\prime }(e)} \) is the qualification for nested interaction effects involving markers j and j’, \( {a}_{j^{\prime }} \) is the corresponding additive effect for the j’ marker and \( {a}_{j{j}^{\prime }} \) is the pairwise additive by additive epistatic marker effect between markers j and j’. The models were fitted to the data with ASReml v3.0 [56].

Validation

Cross validation

To discriminate between true discoveries and false positive SNP effects, and also to confirm significant SNPs that were consistent between breeds, results from the Holstein and Jersey data sets were compared. The larger Holstein population was assigned to the discovery set and results from Holstein analyses were validated in the Jersey breed in two different ways. First, if a significant SNP was found in the discovery process, we examined whether it was also significant in the validation population. Second, for each significant SNP in the discovery population, we searched for significant SNPs in the validation population within the region 500 kb downstream and upstream of the identified SNP. We call the latter segment validation which accounts for the possible difference in allele frequency of markers in LD with the causal mutation in the different population, that may cause different markers in the same region are tagging the same causal mutation in different populations. The window size was chosen by considering the long range of strong LD between SNPs in the Holstein and Jersey breeds [50, 57].

False Discovery Rate (FDR)

The false discovery rate was calculated following the approach proposed by Bolormaa et al. [58].

$$ \%FDR=\frac{P\left(1-\frac{S}{T}\right)}{\left(\frac{S}{T}\right)\left(1-P\right)}\times 100 $$
(2)

where P is the P-value threshold in F-test, S is the number of significant SNPs according to this threshold and T is the total number of tests. This FDR can be used to provide an expectation of the number of true positives at the probability thresholds.

The P-value level used to decide on significance of a main or interaction effect was the corresponding Bonferroni correction of the nominal P-value of 0.05 as: \( \frac{0.05}{Number\ of\ tests}, \) so the adjusted P-value threshold in the analysis including all SNPs was equal to P < 1 × 10−7. In the case that no significant SNP effects were identified at this threshold, an arbitrary P-value threshold of P < 0.0001 in the discovery dataset (i.e. Holsteins) and P < 0.01 in the validation population (i.e. Jerseys) was used. Although this might be arbitrary and some false positives may arise using this approach, it is highly unlikely for significant effects to arise simultaneously in two completely independent populations and our ultimate standard is across breed validation.

For additive and dominance models, all of the markers in the final HD panel were used. To reduce the dimension of SNP combinations tested in the epistatic models, only significant SNPs determined using the P-value of the F-test of the additive model in the Holstein discovery set were included. Therefore, n(n-1)/2 pairwise interactions were tested where n is the number of SNPs whose main additive effect is significant.

The additive (σ 2 a ) or dominance (σ 2 d ) variance of each SNP was estimated as follows:

$$ {\sigma}_{a_i}^2=2{p}_i{q}_i{a}_i^2 $$
(3)
$$ {\sigma}_{d_i}^2=4{p}_i^2{q}_i^2{d}_i^2 $$
(4)

where p i and q i are the frequency of A and a alleles, and a i and d i are respectively the estimated additive and dominance effect of the ith SNP obtained from model (1) and model (2). For most strongly associated additive or dominance SNP effects, these variances were expressed as a fraction of total phenotypic variance (σ 2 p  = σ 2 g  + σ 2 pe  + σ 2 e ) where each variance component was estimated based on model (1) without fitting SNP covariates.

Manhattan plots were created using qqman 0.1.1 [59] in R 3.1.0 (R Development Core, 2014) and the National Center for Biotechnology Information’s databases (www.ncbi.nlm.nih.gov) were used to determine if the significant SNPs were positioned inside known genes. Animal genome QTL database (AnimalQTLdb: http://www.animalgenome.org/cgi-bin/QTLdb/index) was used to compare the results of present study with the reported QTL regions in the literature.