Background

Genomic estimated breeding values (GEBV) are generally predicted by a regression model [1] trained by a set of animals with known phenotypes and genotypes for a dense marker panel that covers the genome [2]. Prediction accuracy of such models depends on several factors, among which size of the set of training animals is most important, which has been addressed in several studies [2],[3] that consistently claim that the biggest limitation for the accuracy of genomic prediction of livestock is the number of animals with both genotype and phenotype data. In most cases, the number of markers is however substantially larger than the number of training samples. This means that genomic prediction typically has a small sample-to-size ratio, which is also known as a n << p problem [1]. One of the major disadvantages is that n << p may lead to a severe over-fitting problem, which may affect the accuracy of the predictions in a validation dataset. Dimension reduction [4],[5] could be an alternative approach to retain the most relevant information of the genotype data [6],[7] in a low-dimensional vector space.

Our study aimed at investigating a more straightforward and feasible approach to alleviate the n < < p problem, which consists of enlarging the training set by using data from multiple populations. However, studies on across-breed genomic prediction using 50 k genotypes have shown that the use of a multi-breed training dataset typically results in a limited or no increase in accuracy compared to using training data from a single breed [8]-[11]. Previous studies have hypothesized that in order to successfully combine training datasets of Holstein-Friesian and Jersey dairy cattle breeds, genotypes on at least 300 000 SNPs (single nucleotide polymorphisms) should be used [12].

Besides insufficient SNP density, another reason that may explain the limited increase in prediction accuracy observed when using multi-population compared to single-population training data could be that the commonly used models cannot deal appropriately with heterogeneous multi-population data. To date, all across-population genomic prediction studies have used linear models. These linear models generally assume that the effect of a SNP in one population is the same in another population. This assumption can be violated due to several reasons. First, the linkage disequilibrium (LD) may differ between populations. Second, it is quite likely that at least some of the segregating QTL (quantitative trait loci) are population-specific. Third, the absolute effect of a QTL may differ between populations because of differences in genetic background. The assumption of linearity may be too rigorous for any of these situations, especially when using the common 50 k SNP chip. In fact, if differences between populations or lines are too large, predictive ability of across-breed genomic prediction with linear models may be lower than that of within-breed genomic prediction [13]. A few studies have proposed to use multi-trait linear models [14]-[16], where trait by line combinations are modelled as separate but correlated traits, to try to accommodate these issues.

As an alternative solution, we propose to use non-linear models by kernel learning [13],[17],[18]. The basic idea is to predict the breeding value of a test animal using a limited number of training animals with similar genotypes that do not necessarily come from a single population. By doing so, the entire heterogeneous data space spanned by genotypes is decomposed into a large number of locally homogeneous sub-areas [19]-[21], regardless of their population of origin. Such a model might be able to extract the useful information across populations. At the very least, the non-linear models by kernel learning are expected to better capture the heterogeneous nature of the data compared to linear models.

The objective of this study was to investigate the accuracy of multi-line genomic prediction using non-linear models by kernel learning and a linear model that modelled trait by line combinations as separate but correlated traits, and to compare the prediction accuracy of these models to that of commonly used linear genomic prediction models presented by Calus et al. [22]. This comparison was performed with a dataset that included three lines of layer hens.

Methods

Linear regression

Linear regression models [23] have been widely used to implement genomic prediction [24]. In concrete terms, the ultimate goal of a regression task is to predict an unseen value y from a vector of observations/features x. In the scenario of genomic prediction, (x, y) corresponds to genotypes (x) and phenotypes (y) of n training individuals. Linear regression uses a linear function to map the observations x to the responsible value y by a vector w as the linear weights on x:

y= w t x,
(1)

where the weight vector w can be estimated using the training data. To best approximate the underlying functional relationship between x and y by Equation (1), ridge regression aims at minimizing the average quadratic loss (L) between the true response value y i and wtx i :

L w = Σ i = 1 n y i - w t x i 2 +γ w 2 y - X w 2 +γ w 2 .
(2)

The vector y refers to a column vector [y1, y2 ,…, y n ]t that contains the phenotypes of all training animals, while the matrix X contains the genotypes of all training animals. The norm of w is the regularization term. Adding it into the objective function alleviates the over-fitting problem, which might be detrimental to prediction performance since the number of genotypes is generally much larger than the number of training samples. Parameter γ refers to the weight given to the regularization term.

Minimization of the loss function L by Equation (2) with regard to w results in the following estimate:

w * = X t X + γ I - 1 X t y.
(3)

If the following matrix lemma [25] is applied:

A - 1 + B t D - 1 B - 1 B t D - 1 =A B t BA B t + D - 1 ,

the solution to w* can be reformulated to:

w * = X t X X t + γ I - 1 y.
(4)

With this estimate, the prediction y* based on the test vector x t becomes:

y * = w * t x t = y t X X t + γ I - 1 X x t .
(5)

These descriptions provide the basis for the development of the non-linear models presented below. For comparison, we included two linear models, i.e. ridge-regression based on principal component analysis (RRPCA) and genome-enabled best linear unbiased prediction (GBLUP) [26]. More detailed descriptions of these models, and the results obtained with these models on this data, are in [22].

Multi-trait genome-enabled best linear unbiased prediction (MTGBLUP)

One of the disadvantages of linear regression is that the underlying data structures might not be well characterized by the linear weights. In genomic prediction, this implies that the estimated effects are not necessarily strictly additive genetic effects [17], and in the context of multi-breed genomic prediction, this may be further interpreted as the true SNP effects not being the same in different breeds or lines. One straightforward approach to allow estimated SNP effects to differ between lines, is to use a multi-trait GBLUP (MTGBLUP) model that allows genetic correlations between the lines to differ from 1 [14]. The data available was not large enough to estimate these correlations; however, additional data was available on non-genotyped animals for each line. Therefore, pairwise genetic correlations between lines were estimated by applying REML (restricted maximum likelihood) [27] with a model that used the inverse of a combined pedigree and genomic relationship matrix [28] that included all three lines. Using this combined relationship matrix, the number of training records ranged from 24 906 to 27 896 across the three lines, while when only genotyped animals were considered, it ranged from 1004 to 1023. Using the estimated variance components, the MTGBLUP model was run using a G-matrix as described in [26], such that only genotyped animals were included in the reference population.

Non-linear kernel regression

Another interpretation of the expectation that the underlying data structures across breeds or lines might not be well characterized by the linear weights is that the inherent mapping function might not be linear. To capture such data features, the common tandem is to adopt a non-linear function (.) {xφ(x)}. The non-linear function results in new representations of genotypes that may be associated with both additive and non-additive effects [17],[29]. Accordingly, Equation (5) can be modified by replacing x by φ(x):

y * = y t Ф X Ф X t + γ I - 1 Ф X φ x t ,
(6)

where Ф(X) contains the transformed genotypes using φ(x). Interestingly, the predictor does not necessarily depend on the mapping function φ(x) but on the inner products between the vectors φ(x) and φ(y), namely φ(x)φ(y)t, as a result of the following terms in (6):

Ф(X)Ф(X)t: the element of the resultant matrix on the i th column and j th row is φ(x i )φ(x j )t,

Ф(X)φ(x t ): the i th element of the resultant vector is φ(x i )φ(x t )t.

This property implies that the design of the kernel function K(x, t) = φ(x)φ(t)t is sufficient to give rise to the predictor without any knowledge on the mapping function φ(x):

y * = y t K + γ I - 1 k,
(7)

where K is a matrix with elements K(x i , x j ), i, j = 1,2,…, n and k is a vector with elements K(x i , x t ), i = 1,2,…, n.

Construction of kernels

One possible interpretation of kernel learning is that the kernel function of two vectors x and t, K(x, t), to some extent describes the similarity between x and t by tending to yield a relatively large value when x is similar to t. There are two typical approaches to evaluate the similarity of two vectors: cross-correlation xtt and distance d(x, t). Both of these are intrinsically related: xtt is inversely proportional to d(x, t) if the measure d is Euclidean distance: d(x, t) = || x - t ||2 = || x ||2 + || t ||2 - 2xtt. Therefore, in this study both cross-correlation-based kernels [13],[30] and distance-based kernels [30]-[33] that use those two similarity measures were used.

Cross-correlation based kernels

The polynomial kernel is the most classical cross-correlation-based kernel [28],[34] that depends on the inner product of two vectors:

K x , t = x t t l .
(8)

This kernel maps the original feature space into one that is spanned by monomials of degree l. A more general definition of the polynomial kernel is:

K x , t = x t t + c l ,
(9)

which is called an inhomogeneous polynomial kernel since a unit shift is added onto the inner product of two vectors. Compared with the homogeneous kernel given by Equation (8), the explicit mapping function of this kernel contains all monomials whose degrees are equivalent to or smaller than l.

Distance-based kernels

Similarity can also be measured by the distance d: if x and t are similar, the function value of d(x, t) should be small. Mathematically speaking, the distance function should satisfy the following three properties:

  1. 1.

    d(x, x) ≥ 0,

  2. 2.

    d(x, t) = d(t, x),

  3. 3.

    d(x, t) < d(x, z) + d(z, t).

Then, a valid kernel can be constructed by the following equation:

K x , t = e - d x , t .
(10)

Distance-based kernels are derived from L p -norm distance, which has been proven to satisfy the aforementioned requirements [34]:

||x || p = x 1 p + x 2 p + + x m p 1 p .

Two well-known distance kernels are special cases of this general equation: the radial basis function (RBF) kernel (p = 2, also known as Gaussian kernel) [31] and the Laplacian Kernel (p = 1) [33]:

K G x , t = e - || x - t || 2 ,
K L x , t = e - || x - t || 1 .

Comparison of methods

In our study, accuracy of genomic prediction based on multi-line training was evaluated for two non-linear models that were based on two different kernels that are the most representative of the two categories of kernels described in the previous section [35]. The first uses the RBF kernel and is termed “RBF” hereafter, and the second uses the polynomial kernel and is termed “Poly” hereafter. Linear regression, also known as ridge regression (RR), is a special case of kernel linear regression that adopts the linear kernel [13]. A method equivalent to RR, i.e. GBLUP that uses a genomic relationship matrix [26], is applied here for comparison.

Considering that the number of SNPs is relatively large compared to the number of animals with phenotypes, all models were also implemented after performing principal component analysis (PCA) to reduce the data dimensions while still explaining 97% of the variance of the SNP genotypes in the data. These three models are termed RRPCA for RR, RBFPCA for RBF kernel based linear regression and PolyPCA for polynomial kernel based linear regression.

Data, pre-analysis, and experimental configurations

To compare the models, data of two brown and one white lines of layer chickens were analysed. The brown layer lines B1 and B2 were closely related to each other, while the white line (W1) was only distantly related to the brown lines. The phenotype data used was the number of eggs in the first production period until the hens reach the age of 24 weeks. Across the three lines, 3753 female birds had both phenotypes and genotypes for 45 974 SNPs from the chicken 60 k Illumina Infinium iSelect Beadchip [36] after editing. More details on the dataset and on the editing of the SNP data are described in Calus et al. [22].

Seven different training sets and one validation set per line were defined to evaluate the accuracy of genomic prediction with single- and multi-line training datasets. For each line, the youngest generation, containing 238 to 240 birds, was used as a validation set. Breeding values for the validation animals were predicted using phenotypes of the training set, which were pre-corrected for hatch week. For the validation animals, the correlation coefficient between the GEBV and their observed phenotypes were computed to evaluate the accuracy of genomic prediction with various training datasets. These correlations are hereafter referred to as ‘predictive correlations’. Commonly, such correlations are divided by the square root of the heritability of the trait to reflect the accuracies of predictions of true breeding values. In this case, we did not do that, because such an adjustment assumes that all the captured genetic variance is additive, while the kernel functions may capture some non-additive effects. Approximate standard errors of the predictive correlations were computed using the expected sampling variance of an estimated correlation ( ρ ^ ), as 1 - ρ ^ 2 N - 2 where N is the number of training animals [24]. The coefficient of the regression of phenotypes on GEBV (b1) was computed to evaluate bias of the predictions. Standard errors of the regression coefficients, denoted as S E b 1 , were derived with bootstrapping, which involved computing regression coefficients for 10 000 bootstrapping samples of the 238 to 240 validation animals, using the R-package “boot” [37]. The regression coefficients were considered as not significantly different from 1 when b 1 - 1 <2×S E b 1 [38].

The first three training sets consisted of one of the three lines. The next three training sets included each of the three pairwise combinations of the three lines. The last training set included layers from all three lines. The resulting training sets included ~1000 to 3000 animals, and the number of segregating SNPs ranged from 30 508 to 45 974 [22].

Results

Genetic correlations between lines

The estimated genetic correlations between the three lines are in Table 1. The genetic correlation between lines B1 and B2 was equal to 0.63, thus significantly larger than 0, which confirms that B1 and B2 are closely related lines. Genetic correlations between lines B1 and W1 and between lines B2 and W1 were equal to -0.26 and -0.55, respectively. The large standard errors of these estimates show that the estimated genetic correlation between line B1 and W1 is not significantly different from 0, while the correlation between B2 and W1 is significantly lower than 0.

Table 1 Estimated genetic correlations between egg production in the three layer lines (standard errors in brackets)

Accuracy of genomic predictions

Tables 2, 3 and 4 show the predictive correlations for each line of six methods using seven training datasets. In the following, we first describe results across the training datasets and then differences between the methods.

Table 2 Performance comparison of seven prediction methods in seven training scenarios for line B1
Table 3 Performance comparison of seven prediction methods in seven training scenarios for line B2
Table 4 Performance comparison of seven prediction methods in seven training scenarios for line W1

Table 2 shows the predictive correlations of line B1 across the training datasets. The impact of multi-line training for line B1 differed slightly between models. Results of the two models with the highest predictive correlations are discussed as examples. The GBLUP model achieved the highest predictive correlation when the model was trained exclusively on data from line B1. In other words, enlarging the training set by adding the training animals from any other line deteriorated the prediction performance. However, the second best model, namely RBF, which had a performance that was slightly inferior to that of the GBLUP model, benefited slightly from enhancing training with data from other lines.

Table 3 contains the predictive correlations for line B2. Compared to the scenario for which the training dataset only contained line B2, both linear models GBLUP and RRPCA had a ~0.03 higher predictive correlation with multi-line prediction. Predictive correlations for the non-linear models were, however, very similar to each other across the training datasets.

Interestingly, focussing on the results for line B1 with training on data from line B2 only, or vice versa, the predictive correlations of the linear and RBF models were clearly superior to those of the Poly models. This suggests that the genotypes of lines B1 and B2 shared some structural similarities that benefitted the predictions of the linear and RBF models. In these situations, the Poly models resulted in predictive correlations that were generally close to 0.

Table 4 shows the predictive correlations for the line W1 validation data. Predictive correlations were very similar across models and training datasets whenever line W1 was included in the training data. When line W1 was not included in the training data, the predictive correlations were always negative, except for MTGBLUP and the Poly models.

Overall, the benefit of multi-line training was limited, and only clearly observed in a few cases when the training data included a closely related line, e.g. lines B1 and B2. Therefore, enlarging the training set with unrelated or distantly related animals did not significantly improve predictive correlations.

Bias of genomic prediction within and across lines

Bias of genomic predictions was assessed by evaluating coefficients of the regression of phenotypes on GEBV. Bias decreases as regression coefficients get closer to 1. For all three lines (See Additional file 1: Tables S1, S2 and S3), bias was more controlled for all models if the evaluated line was included in the training data, otherwise, large biases were observed, especially for the non-linear (Poly and RBF) models. These results indicate that GBLUP, RRPCA, MTGBLUP and RBFPCA gave reasonable results in terms of bias, as long as the evaluated line or a closely related line was included in the training dataset.

Model comparison

Among the non-linear models, the Poly models generally performed worse than the RBF models, both in terms of predictive correlations (Tables 2, 3 and 4) and bias (See Additional file 1: Tables S1, S2 and S3), when the evaluated line was included in the training data. In addition, the predictions of the Poly models had close to 0 predictive correlations and very large biases when based on information from a closely related line (lines B1 and B2).

In the comparison between linear and non-linear models, it is important to note that the non-linear RBF models yielded predictive correlations that were comparable to those of the best linear models (either GBLUP or RRPCA) for lines B1 and W1 when the training data included all lines (Tables 2 and 4). For line B2, RBF performed better than the GBLUP model, while RRPCA had the highest predictive correlation in all scenarios (Table 3). For line B1, however, RRPCA had a lower predictive correlation than the RBF and GBLUP models (Table 2). For lines B1 and B2, the MTGBLUP model generally yielded predictive correlations that were similar to those of most of the other models (Tables 2 and 3). The same was observed for W1 when W1 was included in the training data (Table 4). However, when W1 was not included in the training data, MTGBLUP yielded positive predictive correlations but almost all other models yielded negative predictive correlations.

In summary, the results show that the performance of RBF models was fairly similar to that of the linear models, and that the Poly models generally performed worse. The MTGBLUP model in some situations could generate positive predictive correlations when the trait had a negative correlation between the evaluated line and the line(s) included in the training data.

Complementarity analysis

Because linear and non-linear models focus on different aspects of the genomic data, in this subsection, we analysed the complementarity between models. One way to measure the complementarity between two approaches is based on the correlation between their predictions. Correlations of genomic predictions were computed between models for the training dataset that included all three lines (Table 5). In general, predictions from the Poly models had the lowest correlations with those of other models, which is in line with the observation that, in most cases, the Poly models had the poorest performance in terms of predictive correlation. Ignoring the Poly models, the correlations between predictions from the different models were generally high (>0.9) for line W1. For lines B1 and B2, the predictions from the RBF models had correlations lower than 0.9 with those of GBLUP and RRPCA and higher than 0.9 with those of MTGBLUP. The prediction from the MTGBLUP model deviated substantially from those of GBLUP, with correlations of 0.91 to 0.98. The level of the correlations showed that combining predictions of different models could lead to more accurate predictions. The potential of such an approach was investigated by evaluating combined predictions of two models. A weighted combination of two predictions (â1, â2), can be easily obtained using the following equation:

a ^ =β a ^ 1 + 1 - β a ^ 2 ,0β1,

where parameter β defines the weight given to the two approaches. When β is equal to 0 or 1, the combination is reduced to either of the two predictions. Figure 1 shows the predictive correlations of this combined prediction for the linear models GBLUP and RRPCA and the non-linear model RBF. In Figure 1, each row represents the results for one combination of models and each column represents the results for one of the lines. For line B1, combining predictions from a linear and a non-linear model improved the predictive correlation, especially for the combination of GBLUP and RBF. For line B2, there was little gain by combining models, which is probably due to the superior performance of the RRPCA model. For line W1, the combined prediction was in all cases slightly more accurate. Interestingly, across all situations, the benefit of combining predictions of two models was largest when the two models had a similar predictive correlation.

Table 5 Correlation between genomic predictions obtained from the seven prediction methods
Figure 1
figure 1

Predictive correlations of weighted combinations of genomic predictions with the three models GBLUP, RRPCA, and RBF. In each sub-figure i.e. “GBLUP/RRPCA, B1” means that the prediction is a combination of the predictions of models GBLUP and RRPCA and evaluated on line B1.

Computational complexity

For practical applications of genomic prediction in livestock, it is important that the predictions can be computed efficiently. Therefore, in this section, we analytically evaluate the computational complexity of linear and non-linear models. Revisiting both prediction models, they can be generalized by the following expression:

y * = y t A + γ I - 1 b,

where y is the vector of training phenotypes. For the linear model, A = XXt and b = Xx t (referring to Equation (5)), while for the non-linear model A = K and b = k (referring to Equation (7)). The computation cost depends heavily on the inversion of matrix (A + γ I)-1, which is o(n3) [25]. Parameter n is equal to the dimension of matrix A. The computational complexity of the linear and non-linear models depends on the size of matrix A, which is m × m (i.e. ridge regression BLUP) or n × n (i.e. GBLUP) for the linear models and n × n for the non-linear models implemented in our study, which means that the complexities are either o(m3) or o(n3).

In genomic prediction, the number of genotypes (m) is typically much larger than the number of training animals (n). When ridge regression is used in the linear model (i.e. matrix A is of size m × m) and combined with the use of PCA (i.e. RRPCA in our case), the size of the matrix decreases to less than n × n, because the number of retained principal components will have a maximum value of n-1 [4]. Therefore, computational complexity of the non-linear models implemented in our study is comparable to that of the linear GBLUP model, as summarized in Table 6. Thus, the non-linear models are expected to be able to deal with datasets of similar size as the commonly used GBLUP model.

Table 6 Computational complexity of the implemented linear and non-linear prediction models

Discussion

The objective of this study was to compare the accuracy of multi-line genomic prediction when using non-linear or linear models. In general, when the evaluated line was included in the training data, the non-linear RBF models yielded similar predictive correlations as the linear models. The non-linear models appeared to be slightly less sensitive to the structure of multi-line training datasets. For example, some of the linear models showed small decreases in predictive correlations for lines B1 and W1 when adding other lines [22], but this did not (or rarely) occur for the non-linear models. When only information from a closely related line was used for training, the linear models and the non-linear RBF models had similar performance, indicating that the strong assumptions of the linear models may at least partly hold for the closely related lines used in our study. Our expectation was that the non-linear models would be better able to use relevant information, without making strong assumptions as done in the linear models [21],[39], but the results showed that, overall, the linear models and non-linear RBF models performed similarly.

The complementarity analysis is another aspect of our study. It has been shown that combining genomic predictions of different models, a procedure also known as “bagging” [40], may lead to more robust predictions with generally a higher accuracy [41] or at the very least result in similar accuracies as achieved with the underlying models [42]. In our study, except for line B2, for which RRPCA performed significantly better than any other model, both measures of complementarity indicated that combining linear and non-linear models has the potential to result in slightly more accurate predictions, which means that the linear models capture different features of the data than the non-linear models. The fact that non-linear models captured some predictive variation that was not explained by linear models may be partly due to the ability of non-linear models to capture non-additive effects. Since many non-additive effects are not passed onto the next generation, predictions from non-linear models may be less useful for achieving genetic gain than the linear models. Nevertheless, capturing non-additive effects does help to better predict the performance of an animal itself.

Another focus of this study was to investigate whether the potential benefit of multi-line genomic prediction depends on the genomic similarities of the lines considered. We showed that only some of the lines benefitted from multi-line training, which is consistent with previous studies e.g. [8],[12]. The genotype data of the lines analysed in this work were apparently quite heterogeneous and thus, there was no consistent gain in predictive correlations from using multi-line training data. In some situations, there was a small benefit for lines B1 and B2 but not for W1. This was as expected based on results of the genotype-distance matrix reported by Calus et al. [22], that showed that animals from lines B1 and B2 were more closely related than animals from lines B1 or B2 with animals from line W1. Training data for which relationships with the predicted data are poor, are expected to have negligible contributions to the non-linear predictor. In contrast, the distance between two individuals from lines B1 and B2 was relatively small, indicating that the properties of the genotypes of these two lines were similar. These properties include allele frequencies and LD. Similarities between populations in both of these properties were shown to be closely related to genomic relationships between populations [43]. This might explain the improvement in predictive correlations for lines B1 and B2 in some scenarios when line B1 or B2 was added to the training data. Indeed, the estimated genetic correlations between the lines revealed that the trait investigated was highly correlated between lines B1 and B2. There was, however, no clear improvement in or even deterioration of predictive correlations for lines B1 and B2 when line W1 was included in training, or vice versa. However, across several linear models, positive predictive correlations of 0.10 to 0.14, although not significantly greater than 0, were consistently obtained for line B2 when only line W1 was used for training [22]. Moreover, genetic correlations were equal to -0.26 between lines B1 and W1 and -0.55 between lines B2 and W1, which suggests that information of line W1 was not useful for lines B1 and B2 and vice versa. In summary, a benefit from using multi-line training is especially expected when lines share several common properties, which can be characterized by genomic relationships between lines. Estimating the genetic correlation of the trait between lines may also be very informative. If the distance between the lines is very large and if the estimated correlation is close to 0 or even negative, the benefit of using multi-line genomic prediction is expected to be very limited.

Another interesting conclusion of the comparison between models for the three lines is that no single model was superior over all others for each scenario, which is similar to the results obtained when comparing different linear models [22]. The MTGBLUP model did not necessarily perform better than the other models for lines B1 and B2, but was able to yield substantial positive predictive correlations for line W1 when line B1, B2, or both were used for training. However, when line W1 was used to predict lines B1 and B2, MTGBLUP performed considerably worse than the other linear models. For predicting line B2, RRPCA performed much better than the other models. Interestingly, for line B2, the RBFPCA model was also more advantageous than the other regression models. For predicting line W1, all models performed quite similar whenever line W1 itself was included in the training data.

As an important criterion for model evaluation, the bias of the genomic predictions was evaluated (See Additional file 1: Tables S1, S2 and S3). First, when training and validation data were from the same line, the bias was limited for all models. The genotype distance between a brown hen and a white hen is relatively large such that the kernel value of those two genotypes by Equation (10) becomes small. Therefore, the variance of the GEBV becomes small and the bias accordingly can become very large. In other words, the non-linear models may yield realistic predictive correlations close to 0 combined with very large biases, while the strong assumptions of the linear model appear to control the bias, but at the same time may result in poor predictive correlations. These results highlight the importance of evaluating bias as well as accuracy if the predicted line or breed is not represented in the training data. Conversely, our results show that including the evaluated line in the training data is the best way to control the bias of the predictions, regardless of the model used.

By achieving a significant reduction in the dimension of genotypes, PCA is shown to benefit non-linear models, similar to what has been observed for the linear RRPCA model [22]. Concentrating on the non-linear kernel model that produced the highest predictive correlations, i.e. the RBF kernel, PCA had a minor impact on the predictive correlations, as shown by the correlation between the predictions from RBF and RBFPCA. This might be explained by the nature of the non-linear model: the prediction depends heavily on the distance relationships between training and testing animals, which are not altered by PCA. The Poly models also had very similar predictions whether PCA was performed or not. Regardless, the performance of Poly models was generally worse than that of other models, suggesting that they should not be considered for genomic prediction. Overall, our results with the non-linear RBF and linear RRPCA models suggest that dimensionality reduction of the genotype data might be helpful to decrease computational complexity while hardly affecting model accuracy.

Conclusions

In this study, we investigated genomic prediction with multi-line data. Considering the possible complex heterogeneous data distributions of genotypes in such data, we used non-linear models by kernel linear regression, which rely on the similarity among animals but do not make assumptions on the linearity of genotypes, as the conventional linear models do. On this basis, it was anticipated that the non-linear models would capture different features of multi-line data than the linear models.

Our results indicate that the non-linear RBF models had very similar prediction performance as the generally used linear model GBLUP. Using one line to predict performance in another closely related line, yielded similar prediction accuracies with the RBF and the considered linear models, which suggests that the genotypes of closely related lines share some structural similarities. This was supported by the estimated genetic correlation of 0.63 between the trait in the two closely related lines. Using only data from a distantly related line for prediction with a linear model resulted sometimes in -small positive predictive correlations, in a few cases in considerable negative predictive correlations, and sometimes in predictions with very large bias. This suggests that genomic prediction using only information from a distantly related line or breed should be avoided. Furthermore, despite the similar predictive correlations, linear and non-linear models were shown to capture some complementary predictive information, since the combined prediction slightly improved the predictive correlations.

Additional file