Recently many countries have incorporated the genomic information, in a form of thousands of single nucleotide polymorphism (SNP) genotypes originating from a microarray technology, into their genetic evaluation systems (Hayes et al. 2009, VanRaden, 2008). It has become evident that the genomic information is now an important part of a routine evaluation of genetic merit in dairy cattle (Liu, 2010). In this paper we describe the results of fitting and validating the genomic selection model to the population of Polish Holstein-Friesian dairy cattle.

The data set used as a training data set for the estimation of additive effects of SNPs consisted of 1227 Polish Holstein-Friesian bulls. The selection of bulls for genotyping was based on two major criteria: on the accuracy of their conventionally estimated breeding values and on the representativeness, in terms of genetic merit, of the selected bulls for the population of all dairy bulls active in Poland. The first criterion was quantified through the number of the effective daughter contribution (EDC) associated with the estimated breeding value (EBV) for milk yield of each bull. Traits were represented by EBVs, which were deregressed using the method of Jairath et al. (1998) based on the national proofs corresponding to the release from February 2010. Altogether 29 traits were considered, comprising three production traits and a somatic cell score - originating from a random regression test day model as well as four female fertility traits and 21 traits describing type and conformation - originating from an animal model. The traits are listed in online resource 1. Genotypes were generated by the use of Illumina BovineSNP50 Genotyping BeadChip, which consists of 54 001 SNPs. The applied SNP selection criteria comprised polymorphism, expressed by the minor allele frequency (MAF), with the minimum MAF of 0.01, and technical quality of a SNP, expressed by the minimum call rate of 90% within the analyzed sample of bulls. Average call rate obtained for our data was high and amounted to 99.66% and 99.75% for all SNPs and for selected SNPs, respectively. For DGV estimation 46 267 SNPs were selected, yielding 56 502 470 bull-SNP genotypes in total for milk yield. For the other traits the total number of bull-SNP genotypes was lower since not all of the genotyped bulls had EBVs available.

The following mixed model was used to estimate the additive effects of the selected Nsnp = 46 267 SNPs for up to Na = 1227 bulls with genotypes: \( {\mathbf{y}} = Xb + {\mathbf{Zg}} + e \), where y [Na] represents a vector of deregressed EBVs (dEBVs), X is a [NaxNb] design matrix for fixed effects, b [Nb] is a vector of Nb fixed effects, which in the current model comprise only a general mean (Nb = 1), Z is a [NaxNsnp] design matrix for SNP genotypes, which is parameterized as −1, 0, or 1 for a homozygous, a heterozygous, and an alternative homozygous SNP genotype respectively, g is a [Nsnp] vector of random additive SNP effects, and e is a [Na] vector of residuals with \( {\mathbf{e}} \sim N\left( {0,{\mathbf{D}}\widehat{\sigma }_e^2} \right) \) with D being a diagonal matrix containing the reciprocal of EDC on the diagonal. The covariance structure of g was assumed to be \( {\mathbf{g}} \sim N\left( {0,{\mathbf{I}}\frac{{\widehat{\sigma }_a^2}}{{{N_{snp}}}}} \right) \), with I being an identity matrix and \( \widehat{\sigma }_a^2 \) representing the additive genetic variance of a given trait. The estimation of parameters of the above models was based on solving the mixed model equations: \( \left[ {\begin{array}{*{20}{c}} {\mathop {{\mathbf{b}}}\limits^\wedge } \\ {\mathop {{\mathbf{g}}}\limits^\wedge } \\ \end{array} } \right] = {\left[ {\begin{array}{*{20}{c}} {{{\mathbf{X}}^T}{{\mathbf{R}}^{ - 1}}{\mathbf{X}}} & {{{\mathbf{X}}^T}{{\mathbf{R}}^{ - 1}}{\mathbf{Z}}} \\ {{{\mathbf{Z}}^T}{{\mathbf{R}}^{ - 1}}{\mathbf{X}}} & {{{\mathbf{Z}}^T}{{\mathbf{R}}^{ - 1}}{\mathbf{Z}} + {{\mathbf{G}}^{ - 1}}} \\ \end{array} } \right]^{ - 1}}\left[ {\begin{array}{*{20}{c}} {{{\mathbf{X}}^T}{{\mathbf{R}}^{ - 1}}{\mathbf{y}}} \\ {{{\mathbf{Z}}^T}{{\mathbf{R}}^{ - 1}}{\mathbf{y}}} \\ \end{array} } \right] \) (Henderson, 1984), with R represented by \( {\mathbf{D}}\widehat{\sigma }_e^2 \) and G represented by \( \frac{{\widehat{\sigma }_a^2}}{{{N_{snp}}}} \). The iteration on data technique was based on Gauss-Seidel algorithm with residuals update (Legarra and Misztal, 2008). Consequently, the variance of y is given by ZGZ T + R. Note, that the additive genetic variance component \( \left( {\widehat{\sigma }_a^2} \right) \) of this model was not estimated, but was assumed as known, based on the estimates used in the Polish national genetic evaluation model for a corresponding trait.

DGV is defined as the sum of additive effects of SNPs estimated from the above model: \( \widehat{\mathbf{a}} = {\mathbf{X}}\widehat{\mathbf{b}} + {\mathbf{Z}}\widehat{\mathbf{g}} \). The genomic enhanced breeding values (GEBV) were calculated as a combination of genomic information coming through DGV and the parental information coming through the parent average (PA) using a selection index approach: \( GEBV = {\left[ {\begin{array}{*{20}{c}} {RE{L_{DGV}}} & {RE{L_{PA}}} \\ \end{array} } \right]}{\left[ {\begin{array}{*{20}{c}} {RE{L_{DGV}}} & {RE{L_{DGV}}RE{L_{PA}}} \\ {RE{L_{DGV}}RE{L_{PA}}} & {RE{L_{PA}}} \\ \end{array} } \right]^{ - {1}}}\left[ {\begin{array}{*{20}{c}} {DGV} \\ {PA} \\ \end{array} } \right] \), where REL DGV is reliability of individual's DGV, calculated as explained below, and REL PA is individual's PA reliability originating from the national genetic evaluation. The reliability of DGV was estimated following the approach of Strandén and Garrick (2009), based on the following model: y = Xb + Z*a + e, where, Z* represents a design matrix for DGV - a [Na] vector of random direct genomic value effects for bulls distributed as \( \sim N{\left( {0,{\mathbf{A}}_{g} \widehat{\sigma }^{2}_{a} } \right)} \) with A g defined as \( {\mathbf{Z}}{{\mathbf{Z}}^T}\frac{1}{{p_{het}^b}} \), with \( p_{het}^b \) representing the sum over all SNPs of heterozygous genotype frequencies in the base population estimated following (VanRaden, 2008). The reliabilities of bulls' DGVs are given by: \( {\mathbf{RE}}{{\mathbf{L}}_{{\text{DGV}}}} = diag\left\{ {\left( {{{\mathbf{A}}_g} - \frac{{\widehat{\sigma }_e^2}}{{\widehat{\sigma }_a^2}}{{\mathbf{C}}^{22}}} \right){\mathbf{A}}_g^{^{ - 1}}} \right\} \), where C 22 represents the inverse of the coefficient matrix from the MME corresponding to DGV: \( {\left[ {\begin{array}{*{20}{c}} {{{\mathbf{X}}^T}{{\mathbf{R}}^{ - 1}}{\mathbf{X}}} & {{{\mathbf{X}}^T}{{\mathbf{R}}^{ - 1}}{{\mathbf{Z}}^*}} \\ {{{\mathbf{Z}}^{*T}}{{\mathbf{R}}^{ - 1}}{\mathbf{X}}} & {{{\mathbf{Z}}^{*T}}{{\mathbf{R}}^{ - 1}}{{\mathbf{Z}}^*} + {\mathbf{A}}_g^{ - 1}\frac{{\hat{\sigma }_e^2}}{{\hat{\sigma }_a^2}}} \\ \end{array} } \right]^{ - 1}} = \left[ {\begin{array}{*{20}{c}} {{{\mathbf{C}}^{11}}} & {{{\mathbf{C}}^{12}}} \\ {{{\mathbf{C}}^{21}}} & {{{\mathbf{C}}^{22}}} \\ \end{array} } \right] \).

Descriptive statistics regarding the analyzed traits were summarized in online resource 1, which shows that for each of the analyzed traits DGV had similar, but somewhat lower standard deviations than EBV, which was expected since EBV were used as a dependent variable in the SNP effect estimation model. For the training data set estimated correlations between EVB and DGV, were very high and varied between 0.98 for milk yield, 0.78 and 0.81 for non return rates at 56 days for cows and heifers, respectively - traits with the lowest heritability of 0.02.

The highest positive correlations between SNP estimates were observed for interval from calving to first insemination and days open (0.89), size and stature (0.80), as well as between milk and protein yields (0.76), the negative correlations were highest between overall feet and leg score and real leg set (−0.35), between body depth and udder depth (−0.26) and between rear leg rear view and rear leg set (−0.21). Most of the values (except the correlation between body depth and udder depth) well correspond with the estimates obtained for the Polish Holstein-Friesian breed based on conventional, multivariate models (Żarnecki et al., 2003). Manhattan plots of SNP effect estimates for milk and fat yields along the genome were presented in online resource 2. In order to enable comparison of SNP effects, their estimates were transformed to a standard normal distribution and were presented as absolute values. The highest SNP estimate for milk yield amounted to 3.67 kg, for fat yield 0.20 kg, and 0.0002 day for non return rate at 56 days of heifers. The main goal of genetic evaluation is not to identify particular loci with considerable effects on a trait, but to assess the sum of all possible additive effects across the genome. However, from the geneticists' perspective, a closer examination of effects if particular SNPs and their links to bovine genomic features are of great interest. Estimates of the effect of SNP on milk and fat yield on BTA14 in a proximity of DGAT1 - a gene having very strong effect on both traits (Grisart et al., 2002) were shown on online resource 3. Our result confirmed that DGAT1 locus has a large effect on milk and fat yields and provides empirical evidence of the validity of SNP effect estimation procedure.

In order to formally validate the genomic selection model the procedure recommended by Interbull (Mäntysaari et al. 2010) was followed. For this purpose the original, training data set was partitioned into an estimation data set consisting of older bulls and a validation data set consisting of younger bulls. The validation data set consisted of 232 bulls, while the remaining 984 bulls were used for the estimation of SNP effects. Validation was done for milk, fat and protein yields. The linear regression coefficients for regression of dEBV on PA and GEBV for the three traits were summarised in online resource 4. In general, models involving PA had much lower slopes than models using GEBV as an independent variable, indicating that the latter models had better predictive ability (Fig. 1). The best prediction, indicated by the slope of 0.96 which is closest to the expected value of 1.00, was estimated for regression of dEBV on GEBV for milk yield, and the worst, with a slope of 0.26 was obtained for regression of dEBV on PA for fat yield. The correlations with EBV (Table 1) were lowest for PA (from 0.14 to 0.26), middle for DGV (from 0.32 to 0.38), and generally the highest when both sources of information were combined into GEBV (from 0.31 to 43). One exception was fat yield, for which the highest correlation was obtained using DGV.

Fig. 1
figure 1

Predictive ability for PA and GEBV expressed as a linear regression for 232 bulls from the validation data set

Table 1 Pearson correlation coefficients between EBV from 2010 and PA/DGV/GEBV together with the reliability of DGV and GEBV, calculated based on daughter information from 2004 for the validation data set. Nv is the number of bulls in the validation data set

Many simulated as well as real data sets have been analysed in order to compare predictive ability of various models used for the estimation of SNP effects (Clark et al. 2010; Konstantinov and Hayes 2010; Mrode et al. 2010; Shepherd et al. 2010). Summarising the results of those studies one can conclude that no marked differences in predictive abilities can be observed between models. Instead factors related to the trait genetic background (heritability, number of loci with large effects) as well as the structure of the training data set play a key role in determining correlations between the predicted and true genetic merits (Calus, 2010). Results obtained in our study clearly show that a much better accuracy of prediction for selection candidates can be achieved by using a combined information from SNP genotypes (through DGV) and parental EBVs (through PA) instead of the conventional approach based entirely on the EBVs of ancestors.

In our study a low reliability of DGV was obtained for the young selection candidates. It is much lower than values reported for production traits by Hayes et al. (2009), Lund and Su (2009), and VanRaden et al. (2009), which vary between 0.45 and 0.73. The main reason for low values obtained in our study was, as indicated by Hayes et al. (2009) and Habier et al. (2010), a relatively small training data set and corresponding low genetic relatedness between the training and the selection candidate data sets (only 59% of bulls from the validation data set had sires in a training data set). Still, the obtained accuracy of DGV and GEBV was much higher than the accuracy of PA. Moreover, based on the results for protein yield, the predictive ability of the genomic model described here was positively validated by the International Bull Evaluation Service (Interbull and International Bull Evaluation 2010) in August 2010. Consequently, the model presented in this study has been recognised within European Union states by the Directorate of Animal Health and Welfare of the European Commission as a valid procedure for genomic evaluation.