Background

Genomic selection was proposed in 2001 [1] and has since then become a major research topic in animal breeding. Accuracy of genomic prediction depends greatly on the size of the reference population [2, 3]. The larger the reference population, the more accurate genomic prediction is. It was reported that reliabilities of genomic prediction of Holstein cattle increased when Holstein cattle of other countries were added to the reference dataset [46]. Similarly, pooling genotypes from other countries or populations to form a common reference population helped to increase the reliability of predictions in Brown Swiss cattle [6, 7]. In addition, reliabilities of genomic prediction obtained by combining the reference populations of Danish, Swedish and Finnish red cattle were higher than those using single-country reference populations [8]. Holstein dairy cattle in China were originally imported from Europe and North America and were mostly derived from cross-breeding between the local yellow cattle and imported foreign Holstein bulls. It is assumed that the current Chinese Holstein population is genetically close to the other Holstein populations in the world. To date, the reference population of genotyped Holstein cattle in China is relatively small and includes mainly cows. It is expected that a joint reference dataset that combines Chinese Holstein cattle and Holstein cattle from other populations will greatly improve the reliability of genomic predictions of the Chinese Holstein population, assuming linkage disequilibrium between markers and quantitative trait loci (QTL) is consistent between the populations.

The objectives of this study were to: (1) estimate the consistency of linkage disequilibrium between the Chinese and the Nordic Holstein populations and (2) assess the gains in reliability of genomic predictions in Chinese Holstein from using a joint Chinese and Nordic reference dataset, compared with using the Chinese reference dataset alone.

Methods

Data

In this study, both the Chinese Holstein (CH) and Nordic Holstein (NH) cattle were genotyped with the Illumina BovineSNP50 BeadChip (Illumina, San Diego, CA). The single nucleotide polymorphism (SNP) data of each population were edited separately by deleting SNP with minor allele frequencies less than 0.01 or call rates less than 0.10, and excluding individuals with more than 10% missing marker genotypes. After editing, 41 838 SNP on 29 autosomes were retained in both populations. The genotyped CH cattle included 80 bulls born between 1993 and 2002 and 2091 cows born between 2001 and 2006, which were daughters of 13 of the genotyped bulls. The number of daughters per bull ranged from 63 to 358, with a mean of 135. The genotyped NH cattle included 5216 bulls born between 1974 and 2008. All animals of both populations were used in the linkage disequilibrium analysis. Deregressed proofs (DRP) were used as phenotypes for genomic prediction. DRP of CH bulls and cows were derived from the estimated breeding values (EBV) obtained from the Chinese genetic evaluations in April 2012 (Dairy Association of China), and DRP of NH bulls were derived from the EBV of Nordic genetic evaluations in November 2010 (Nordic Genetic Evaluation). Three traits (milk yield, fat yield and protein yield) were analyzed. In total, 4398 NH bulls and all CH animals had phenotypes for the three traits. 512 CH cows with possible incorrect sire information were discarded based on parentage verification with 255 SNP performed in a previous study [9] in which paternity was considered incorrect if five or more SNP were in conflict (i.e., a sire was homozygous for one allele but its daughter was homozygous for the other allele). Consequently, 1572 CH cows and 80 CH bulls with reliable pedigree information were used for genomic prediction.

Measure of linkage disequilibrium (LD) and consistency of LD

Each chromosome was phased separately using Beagle [10] for each population, and all missing genotypes were simultaneously imputed by Beagle. Linkage disequilibrium between a pair of SNP was measured as r LD 2 [11], and r LD was calculated as follows:

r LD = f AB - f A f B f A f a f B f b ,

where f(AB), f(A), f(a), f(B) and f(b) are observed frequencies of haplotype AB, alleles A, a, B and b, respectively. Maternal and paternal haplotypes were pooled to calculate LD. The consistency of LD in the two populations was measured by the correlation of r LD of adjacent marker pairs on each autosome between the two populations [12].

Prediction of genomic breeding values

A single-trait GBLUP model was used for genomic prediction based on the Chinese reference dataset, and a two-trait GBLUP was used for genomic prediction based on the joint reference dataset that included both the CH and NH cattle. In the latter, a single biological trait was regarded as a different trait in the two populations. The basic GBLUP model [13, 14] was

y = 1 μ + Zg + e ,

where y is the vector of phenotypes, μ is the population mean, g is the vector of breeding values, e is the vector of residuals, and Z is a design matrix allocating g to y. It was assumed that g ~ N 0 , G σ g 2 and e ~ N 0 , D σ e 2 , where G is the genomic relationship matrix, σ g 2 is the additive genetic variance, D is a diagonal matrix with weights of the residual variance [15], and σ e 2 is the residual variance. The G matrix was constructed according to the method (method 1) described by VanRaden [14], i.e. G = MM '/∑2p i (1 - p i ), where elements in column i of M are 0 - 2pi, 1 - 2pi and 2 - 2pi for genotypes A1A1, A1A2 and A2A2, respectively, and pi is the allele frequency of A2 at locus i, calculated from the available marker data. Variances and covariances in the GBLUP models were estimated using the “average information” restricted maximum likelihood algorithm, and the GBLUP analyses were conducted using the DMU Package [16].

Validation

Accuracy of genomic predictions using the CH and the joint reference datasets was assessed by two validation procedures. In the first procedure, the reference dataset comprised 13 genotyped CH bulls and their genotyped daughters and the test dataset consisted of 48 CH bulls without genotyped daughters. The remaining 19 CH bulls, which were highly related with the 13 reference bulls, were not used in the test dataset. In the second procedure, a five-fold cross-validation procedure was used for genomic prediction of CH cows. In each fold of cross-validation, two or three half-sib families of cows were used as the test dataset and the remaining cows and genotyped CH bulls were used as reference dataset. The numbers of animals in the reference and test datasets for the two validation procedures are shown in Table 1. Genomic EBV (GEBV) for the animals in a test dataset were predicted using both the CH and the joint reference datasets. The joint reference dataset included all 4398 Nordic bulls.

Table 1 The size of test and reference datasets used for validating genomic predictions of Chinese Holstein (CH) bulls and cows

Reliabilities of GEBV for bulls and cows were measured as the squared correlation of GEBV and DRP divided by the average reliabilities of the DRP in the test dataset (Cor2(GEBV,DRP)/r2DRP) [15]. Because CH bulls were born between 1993 and 2002, genetic trend due to selection could inflate the correlation between GEBV and DRP. Therefore, in the validation of genomic predictions for CH bulls, genetic trends present in GEBV and DRP were corrected by regressing on birth year, and then the reliabilities were calculated by correlating the corrected GEBV and DRP. In the validation of genomic predictions for cows, reliabilities were calculated based on GEBV pooled over the five test datasets.

Results and discussion

Linkage disequilibrium and consistency of LD

Data on the 41 838 SNP distributed over the 29 bovine autosomes are summarized in Table 2. The average distance between adjacent SNP was 59.59 Kb, and the shortest and longest gaps were 3 bp (on BTA7) and 3820 Kb (on BTA10). As shown in Figure 1, most distances between adjacent markers were less than 100 Kb. However, 2.55% adjacent SNP had large distance gaps (larger than 200 Kb). This indicates that SNP used in the current study were not evenly distributed over the bovine chromosomes.

Table 2 Distance and linkage disequilibrium ( r LD 2 ) of adjacent SNP for each Bos taurus autosome (BTA)
Figure 1
figure 1

Frequency distribution of distances between adjacent SNP pairs.

Figure 2 presents the distribution of r LD 2 of adjacent SNP pairs in the CH and NH populations. The LD between adjacent SNP pairs was generally small. The proportion of adjacent SNP pairs which had an r LD 2 < 0.01 was 15.2% for CH and 17.6% for NH; the proportion with an r LD 2 < 0.1 was 52.81% for CH and 51.72% for NH. These results indicate that the SNP markers did not effectively capture all the QTL affecting a trait, since most adjacent SNP pairs were in weak LD. The mean r LD 2 of adjacent SNP pairs within a chromosome ranged from 0.16 to 0.24 in CH and from 0.16 to 0.25 in NH. The mean r LD 2 across all chromosomes was 0.20 in CH and 0.21 in NH. The consistency of LD between the CH and NH populations was high (Table 2). The correlation of r LD of adjacent SNP ranged from 0.95 to 0.98 across all chromosomes, with a mean of 0.97 at an average marker distance of 60 Kb.

Figure 2
figure 2

Frequency distribution of r LD 2 for adjacent SNP pairs in Chinese Holstein and Nordic Holstein populations.

The mean r LD 2 in the CH and NH populations was similar to the degree of LD reported for the Holstein populations in Germany [17], in the Netherlands, Australia and New Zealand [12], and in North America [18]. The consistency of LD between the CH and NH populations agreed with the consistency of LD between the Dutch and Australian Holstein bulls reported in [12] and was in line with the development of the CH population. The first dairy cattle imported in China came from Europe in the 1870’s [19] and since then, Holstein cattle have continuously been imported from Europe, Japan and North America. The imported Holstein bulls were crossed with local yellow cattle, and the crossbred cows were continuously back-crossed with the imported Holstein bulls [20]. The resulting crossed black and white dairy cattle were officially named Chinese Holsteins in 1992. Currently, most of the Holstein bulls found in China were imported from worldwide in the form of embryos or live cattle. Besides, the NH population has also exchanged genetic material with the United States, the Netherlands, Germany and other countries. The genomic relationship matrix showed that some CH bulls could be full-sibs or half-sibs of the NH bulls. Based on these data, it can be inferred that the Chinese Holstein population is genetically close to European and North American Holstein cattle.

Genomic prediction

Reliabilities of GEBV for the 48 CH test bulls that had no genotyped daughters and the 1572 CH cows in the test dataset with the five-fold cross-validation are presented in Table 3. When the CH cattle were used as the reference data, the reliabilities of GEBV were 0.16 for CH test bulls and 0.14 for CH cows, averaged over the three traits. When using the joint reference dataset, the average reliabilities increased to 0.45 for CH bulls and to 0.21 for CH cows. Based on data from the Dairy Association of China, the reliability of a conventional pedigree index (calculated as 0.5*sire EBV + 0.25*maternal grandsire EBV) in the CH population was 0.12 for milk production traits. Thus, the reliabilities of genomic predictions based on the CH reference population were of similar magnitude as the reliabilities of a pedigree index, and the genomic predictions based on the joint reference population gave much higher reliabilities than those based on a pedigree index.

Table 3 Reliabilities of GEBV of Chinese Holstein (CH) bulls and cows in the test populations when using the CH or the joint reference population

The low reliabilities of genomic predictions using the CH reference dataset alone suggested that the information in the CH reference dataset, which mainly contained cows, was insufficient. According to Goddard [2], the reliability of GEBV for different sizes of reference population and different heritabilities of traits can be predicted as:

E r GEBV 2 = 1 - λ 2 N a × log 1 + a + 2 a 1 + a - 2 a ,

where N is the number of individuals in the reference population and a = 1 + 2λ/N. According to Hayes et al. [21], λ = M e k/h2, M e = 2N e L, and k = 1/log(2Ne), where N e is the effective population size and L is the length of the genome in Morgans. Using the above formula, the reliability of GEBV based on the CH reference dataset was expected to be 0.175, assuming L = 30, Ne = 100, N = 1500 and r2 DRP  = 0.50. The reliabilities obtained from the validation procedures were consistent with these expected reliabilities. The results indicate that the size of the CH reference population needs to be increased in order to increase the reliability of genomic predictions.

Dairy cattle reference populations usually comprise progeny-tested bulls to maximize the information from each genotyped individual. In some countries or in some cattle populations where the number of progeny-tested bulls is small, one solution is to include cows in the reference population. In order to evaluate the value of adding cows to the reference dataset, an additional analysis was performed using a CH reference dataset from which 50% of the cows were deleted. The reliabilities of GEBV for the CH bulls using the reduced CH reference dataset decreased to 0.09, 0.03 and 0.05 for milk yield, fat yield and protein yield, respectively. This indirectly demonstrates that it is feasible to use cows as reference animals for genomic prediction, when the number of available progeny-tested bulls is not sufficient. A simulation study by Mc Hugh et al. [22] also suggested that genomic information from cows could greatly increase genetic gain and accuracy of male selection. To increase the size of the cow reference population at low cost, a good alternative would be to genotype cows using a low density chip like the Bovine LD (7 K) and then impute the genotypes for the 54 K panel.

The joint reference dataset greatly improved the reliability of genomic predictions for the CH cattle. The reliabilities of GEBV for CH bulls based on the joint reference dataset were close to those for NH bulls based on the Nordic reference data [23]. Several studies have reported that the reliability of genomic prediction can be increased by using a joint reference dataset that includes reference animals from other populations. The reliabilities of GEBV increased by 10% on average when four European Holstein populations were combined into a reference dataset, compared to when only one national population was used as the reference population [5]. Reliabilities of genomic prediction for Canadian Holstein bulls increased by 6% on average when about 3000 foreign bulls were included in the reference dataset [4], and by 7% when all North American sires were included [24]. Reliabilities were 2.6% higher for Holstein and 3.2% higher for Brown Swiss cattle when 3593 foreign Holstein and 732 foreign Brown Swiss animals were included in the reference dataset of the USA domestic prediction [6].

With the joint reference dataset, reliabilities of genomic predictions improved more for CH bulls than for CH cows i.e. by 2.3 fold for CH bulls and only by 1.7 fold for CH cows. This is due to the fact that CH bulls have a closer relationship with NH bulls than the CH cows do. Among the 48 CH test bulls, 14 bulls had a genomic relationship with one or more NH bull in the range from 0.45 to 0.56. However, no CH cow had this level of relationship with any NH bull. Moreover, among the 48 CH test bulls, 33 (68.75%) had a genomic relationship greater than 0.2 with at least one NH bull (with 15.5 bulls on average). Among the 1572 CH cows, only 459 (29.2%) had a genomic relationship greater than 0.2 with an NH bull (with 1.3 bulls on average). Many previous studies reported that the existence of a close relationship between test animals and reference animals increased the reliability of genomic predictions for the test animals [2527].

To avoid overestimation of the reliability of GEBV, 19 CH bulls were excluded from the test dataset because they were highly related to 13 bulls in the reference population. In the five-fold cross-validation for the CH cows, two or three half-sib families were randomly assigned to a single-test dataset, instead of randomly choosing individuals. This was done to avoid overestimation of the reliability of GEBV when animals in the test dataset have a large group of half-sibs in the reference dataset. Moreover, genetic trend can increase the correlation between GEBV and DRP if the birth years of the animals in the test dataset cover a wide range. Therefore, in the validation for the CH bulls, the correlation between GEBV and DRP was calculated after correcting for genetic trend. When ignoring this correction, the validation reliabilities of genomic predictions for CH bulls using the joint reference dataset were unrealistically high at 0.69, 0.54 and 0.60 for milk yield, fat yield and protein yield, respectively.

In the current study, when using the joint reference dataset, genomic predictions were estimated using a two-trait model, in which the same biological trait was considered to be a different trait in the CH and NH populations. The reason for using a two-trait model, instead of a single-trait model, was that the DRP had different scales in the two populations due to the use of a standardization procedure in the NH population. The two-trait model also accounts for the presence of any genotype by environment interactions. When genotypes of bulls from three foreign countries were included in the USA domestic predictions, multi-trait methods were not more accurate than the single-trait model for Holstein cattle, but gave higher reliabilities (1.4% higher on average) for Brown Swiss cattle [6]. The authors suggested that this could be due to lower genetic correlations of traits between Brown Swiss populations. Using the two-trait GBLUP model, the estimated genetic correlations between the CH and NH populations were 0.85, 0.70 and 0.75 for milk yield, fat yield and protein yield respectively, which were much lower than the value of the consistency of LD of neighboring markers, which was 0.97 between the two populations. Assuming that the consistency of LD was appropriate to represent the genetic associations between different populations, its clear difference with the estimated genetic correlations suggests the existence of a large genotype by environment interaction between China and Nordic countries.

Conclusions

The consistency of LD is very high between the CH and NH populations, indicating a high level of genetic similarity between the two populations. Genomic prediction for CH cattle can be greatly improved using a joint reference dataset that includes CH and NH cattle. In order to obtain satisfactory reliabilities of genomic predictions for CH cattle, it is necessary to increase the size of the CH reference population or to include foreign Holstein cattle in the reference population.