1 Introduction

The rapid development of sequencing technologies has made it possible to use dense molecular marker information for genomic selection (GS) (Meuwissen et al. 2001), which has been shown to improve prediction accuracy (de los Campos et al. 2010; Crossa et al. 2010; Heslot et al. 2012; Pérez-Rodríguez et al. 2012).

The standard regression model of phenotypes, \(y_i \), (\(i=1,2,{\ldots },n\) individuals) on markers (\(j=1,2,{\ldots },p\) markers) is represented by \(y_i =\mu +\sum _{j=1}^p x_{ij} \beta _j +\varepsilon _i \), or, in matrix notation, as

$$\begin{aligned} {\varvec{y}}=\mu \mathbf{1}+\varvec{X}\varvec{\beta } +\varvec{\varepsilon }, \end{aligned}$$

where \(\mu \) is a general mean, \(\mathbf{1}\) is a vector of ones of order \(n\times 1, \varvec{X}\) is the \(n \times p\) incidence matrix of standardized markers, \(\varvec{\beta } \) is the vector of unknown marker effects, and \(\varvec{\varepsilon }\) is the vector of model errors. Standard assumptions of normality, independency and homoscedasticity are such that \(\varvec{\varepsilon } \sim N\left( \mathbf{0},\sigma ^{2}{\varvec{I}}_n \right) \). Assuming \(\varvec{\beta } \sim N\left( \mathbf{0},\sigma _\beta ^2 \varvec{I} \right) \) and independent of \(\varvec{\varepsilon }\), then (1.1) is usually fitted with the Ridge Regression Best Linear Unbiased Predictor (RRBLUP) method. By letting \(\varvec{u}=\varvec{X}\varvec{\beta }\), (1.1) may be represented as

$$\begin{aligned} \varvec{y}=\mu \mathbf{1}+\varvec{u}+\varvec{\varepsilon }, \end{aligned}$$

where \({\varvec{u}}\) is a vector of random genetic effects and independent of \(\varvec{\varepsilon }\). If \({\varvec{u}}\sim N\left( {\mathbf{0}}, {\varvec{G}}\sigma _u^2 \right) \), and \({\varvec{G}}=\frac{{\varvec{XX}}^{{\prime }}}{p},\) then (1.2) indicates a genomic mixed model which may be estimated through the GBLUP method (VanRaden 2008). Although (1.1) and (1.2) are equivalent, estimation of (1.2) is computationally more convenient (de los Campos et al. 2012). Models (1.1) and (1.2) may be extended in order to account for the vector of fixed effects \(\varvec{\vartheta }\) of factors arranged in an incidence matrix \({\varvec{Z}}\) by adding the term \({\varvec{Z}}\varvec{\vartheta }\) to the right-hand side of (1.1) or of (1.2). However, the data included in this study were already pre-corrected for fixed effects such as experimental design and location effects.

A great deal of research on developing Bayesian linear regression models for capturing the true genetic signal has been published recently; usually these Bayesian models differ in their prior distributions (de los Campos et al. 2012). In general, the parametric regression function has a rigid structure comprising a set of assumptions which may not be met in genomic selection problems. Furthermore, the sample size (n) is usually much smaller than the number of predictors (markers), p (\(p\gg n\)), a problem known as “the curse of dimensionality” (Bellman 1961). Departures from linearity can be addressed by semi-parametric approaches, such as Reproducing Kernel Hilbert Space (RKHS) regressions or different types of neural networks (Gianola et al. 2006, 2011; Gianola and van Kaam 2008; de los Campos et al. 2010; González-Camacho et al. 2012; Pérez-Rodríguez et al. 2012). Gianola et al. (2006, 2014) suggested using RKHS regression for semi-parametric, genomic-enabled prediction, and pointed out that non-parametric methods such as kernel regression are necessary to reduce the dimension of the parametric space, and may be able to capture complex cryptic interaction among markers. However, recovering non-additive interaction among markers is an open field of research and the most successful results have been obtained through kernel-based methods (Howard et al. 2014). Morota and Gianola (2014) and Gianola et al. (2014) pointed out that most studies carried out so far suggest that whole-genome prediction coupled with combinations of kernels may capture non-additive variation.

The basic idea underlying the RKHS approach to GS (Kimeldorf and Wahba 1971; Gianola 2013) is to use the matrix of markers \({\varvec{X}}\) to build a covariance structure among genetic values \({\varvec{u}}\) in (1.2). Therefore, in this context, \({\varvec{u}}\sim N\left( \mathbf{0},\sigma _a^2 {\varvec{K}}_h \right) \) is independent of \(\varvec{\varepsilon } \) (de los Campos et al. 2010), \({\varvec{K}}_h \) is a symmetric positive semi-definite matrix of order \(n \times n\), known as the reproducing kernel (RK) matrix, which depends on the markers, the bandwidth parameter \(h>0\), the additive genetic variance component \(\sigma _a^2 >0\), and \(\varvec{\varepsilon }\) is an \(n \times 1\) vector of homoscedastic and independent normal errors. This general approach requires choosing an RK, for example, a Gaussian kernel function \(K_h \left( {\varvec{x}}_i ,{\varvec{x}}_j \right) =\exp \left( {-hd_{ij}^2 /q_{0.05} } \right) \), where \({\varvec{x}}_i \) and \({\varvec{x}}_j \) are the marker vectors for the ith and jth individuals, respectively, and \(q_{0.05} \) is the fifth percentile of the squared Euclidean distance \(d_{ij}^2 =\sum _k \left( {x_{ik} -x_{jk} } \right) ^{2}\) (González-Camacho et al. 2012).

Let \({\varvec{u}}={\varvec{K}}_h {\varvec{f}}\); then the (1.2) may be represented as

$$\begin{aligned} {\varvec{y}}=\mu \mathbf{1}+{\varvec{K}}_h {\varvec{f}}+{\varvec{\varepsilon }} , \end{aligned}$$

One of the methods used in GS to fit (1.2) and (1.3) is the standard ridge regression (Hoerl and Kennard 1970). Assume that shrinkage parameter \(\gamma \) is a known scalar and that the square of the norm of \({\varvec{u}}\in H\), (where H is the collection of functions in RKHS of real-valued functions) is given by \(\Vert {\varvec{u}}\Vert _H^{\mathbf{2}} ={{\varvec{f}}}' {\varvec{K}}_h {\varvec{f}}\) (Kimeldorf and Wahba 1971); then for each value of \(\gamma \), the ridge estimator is \( {\hat{{\varvec{f}}}} =({\varvec{K}}_h +\gamma {\varvec{I}})^{-1}\left( {{\varvec{y}}-\bar{y} \mathbf{1}} \right) \). Then it follows that \({\hat{{\varvec{u}}}} ={\varvec{K}}_h ({\varvec{K}}_h +\gamma {\varvec{I}})^{-1}\left( {{\varvec{y}}-\bar{y} \mathbf{1}} \right) \). A Bayesian version can be derived by assigning the prior \({\varvec{f}}\sim N\left( {\mathbf{0},\sigma _a^2 {\varvec{K}}_h ^{-1}} \right) \) or, equivalently, \({\varvec{u}}\sim N\left( {\mathbf{0},\sigma _a^2 {\varvec{K}}_h } \right) \), where \(\sigma _a^2 \) and h are unknown hyperparameters and \(\gamma =\sigma ^{2}/\sigma _a^2 \), so that \({\hat{{\varvec{f}}}} \) and \({\hat{{\varvec{u}}}} \) are the modes of the posterior distributions of \({\varvec{f}}\) and \({\varvec{u}}\), respectively.

In the non-parametric regression framework, the key idea is to develop more flexible models that will capture complex signals, thus improving the prediction accuracy of the genetic values. The RK function takes as input the markers used to create a covariance structure for the genetic values \(\mathrm{Cov}\left( {u_i ,u_j } \right) =\sigma _a^2 K_h \left( {{\varvec{x}}_i ,{\varvec{x}}_j } \right) \) (\(\sigma _a^2 >0)\), where \(u_i\), and \(u_j \) are the genetic values of the ith and jth individuals. For example, the linear kernel given by \({\varvec{K}}=\frac{{\varvec{XX}}'}{p}\) [see the definition of \({\varvec{G}}\) in (1.2)] can be used to reduce the dimensionality of the genotypic data and, hence, the number of parameters to be estimated (de los Campos et al. 2010; VanRaden 2008). A comprehensive review of various kernel-based approaches for capturing genetic variation in the context of genomic-enabled prediction was recently published by Morota and Gianola (2014).

The RK function has two components: a distance measure between individuals based on markers and the bandwidth parameter h that controls the rate of decay of the covariance between genotypes. The RK function can either be selected from different RK options or constructed with Mercer’s theorem, as described by Genton (2001). The most commonly used RK function in GS is the Gaussian kernel (Gianola et al. 2006). Other RK functions, such as the t (Tussel et al. 2014) and the exponential (Endelman 2011), have also been used. However, none of them have proven to be consistently superior to the Gaussian RK function (Ober et al. 2011). Assuming that the Gaussian kernel has been selected, a remaining crucial step is to tune the bandwidth parameter, which is usually based on cross-validation or penalization (Härdle 1990).

Assume that the RKHS in (1.3) is non-parametric and involves a fixed operator \({\varvec{K}}_h \) for a known h. Then, given h, we need to estimate \({\varvec{f}}\) or, equivalently, \({\varvec{u}}\) and the variance components. One way of achieving this goal is to use Bayesian inference within the theoretical framework of inverse problems (Aster et al. 2005), such as those proposed by Knapick et al. (2012) and Cavalier (2008), and apply them to genomic-enabled prediction (Cuevas et al. 2014). In an attempt to optimize prediction ability in GS, the bandwidth parameter h is selected by means of cross-validation in a grid of values, as shown in Heslot et al. (2012). However, cross-validation methods for selecting h are computationally very intense and sometimes not easy to apply in GS (Gianola and van Kaam 2008). Endelman (2011) used the restricted maximum likelihood (REML) to jointly fit (1.2) including h for a model with the Gaussian RK function; the proposed method (GK REML) is implemented in the R package rrBLUP. Note that when prior distributions of variance components are assumed to be flat, the REML estimators coincide with the mode of the marginal posterior density (Harville 1974; Blasco 2001). An efficient Bayesian method that minimizes the effect of selecting an inadequate value of h and thus improves GS prediction accuracy was proposed by de los Campos et al. (2010), who used an extension of (1.2) with \({\varvec{u}}={\varvec{u}}_1 +\cdots +{\varvec{u}}_k \), such that each random effect, \({\varvec{u}}_{\mathrm{i}} ,i=1,\ldots ,k\), is weighted by its variance component in a process termed Kernel Averaging (KA) or multi-kernel. Kernel averaging uses a prior distribution that assumes independence among random effects \(p\left( {{\varvec{u}}_1 ,\ldots ,{\varvec{u}}_k |\sigma _{u_1 }^2 ,\ldots ,\sigma _{u_k }^2 } \right) =N\left( {{\varvec{u}}_1 |\mathbf{0},{\varvec{K}}_1 \sigma _{u_1 }^2 } \right) \times \cdots \times N({\varvec{u}}_k |\mathbf{0},{\varvec{K}}_k \sigma _{u_k }^2 ).\) The set of kernels \(\{{\varvec{K}}_1 ,\ldots ,{\varvec{K}}_k \}\) is defined based on a set of values of \(h\in \left\{ {h_1 ,\ldots ,h_k } \right\} \). The set of values of h is selected by first defining a range of values that will avoid small and large values of h. Achieving these criteria is important because small values of h will produce a kernel matrix of ones, whereas large values of h will produce a kernel matrix whose off-diagonal elements will approach zero. These two extreme cases are avoided because they do not contribute information that can be used for prediction; for example, in the case of small values of h, the information provided is redundant with the intercept [which is already included in (1.3)] and in the case of large values of h, will lead to a random effect with a variance covariance structure similar to that of the error term in (1.3). de los Campos et al. (2010) also show that \({\varvec{u}}={\varvec{u}}_1 +\cdots +{\varvec{u}}_k \sim N\left( {\mathbf{0},\sigma _u^2 \bar{K} } \right) \), where \(\bar{{\varvec{K}}} =\sum _l {\varvec{K}}_l \sigma _{u_l }^2 /\sigma _u^2 \) and argue that inferring the weights leads to a kernel \(\bar{{\varvec{K}}} \) that is optimal. The strategy for specifying the values of h is also described in Pérez-Rodríguez and de los Campos (2014).

In this study, we propose a Bayesian method for selecting the bandwidth parameter h following a simple and logical idea put forward by Gianola and van Kaam (2008), that is, to assign a prior \(p\left( h \right) \) and obtain a posterior point estimate of h. This is achieved by transforming (1.3) following the approach of Cuevas et al. (2014), which is similar to that proposed by de los Campos et al. (2010), but transforming both sides of (1.3) as in Cavalier (2008). This paper is organized as follows. In Sect. 2, we define the general framework (prior and posterior distributions) of the proposed Bayesian Transformed Gaussian Kernel (TGK) model; we also describe the likelihood, the prior, and the posterior of the bandwidth parameter. In Sect. 3, we describe the two real datasets analyzed using the TGK model, as well as details of the prediction assessment using a random cross-validation scheme. Section 4 gives the prediction accuracy results for the two datasets; results are discussed in Sect. 5 and conclusions are provided in Sect. 6.

2 Statistical Methods

2.1 Bayesian Selection of the Bandwidth for a Transformed Gaussian Kernel (TGK)

For genomic-enabled prediction, Cuevas et al. (2014) showed the advantages of transforming both sides of the parametric linear regression model by an orthonormal matrix based on a singular value decomposition (SVD) of the matrix of regression variables, \({\varvec{X}}\). Some advantages of this method are (i) it reduces computing time because of the reduction in dimensionality of the vector of regression parameters, (ii) the prior distributions for regression parameters have variances that mimic the decay of the singular values of \({\varvec{X}}\), thus controlling over-fitting, and (iii) the posteriors of the transformed regression parameters are conditionally independent. Following Cuevas et al. (2014), a model based on \({\varvec{K}}_h \) (not directly \({\varvec{X}}\)) is described below.

In (1.3), we replace the linear operator \({\varvec{K}}_h \), with its eigenvalue decomposition \({\varvec{K}}_h ={\varvec{U}}_h {\varvec{S}}_h {\varvec{U}}_h^{\prime } \), where \({\varvec{U}}_h \) is a square orthogonal matrix and \({\varvec{S}}_h \) is the diagonal matrix of eigenvalues, and where sub-indexes express dependency on the value of h; this eigenvalue decomposition is a common strategy also used in parametric methods for genome prediction (Zhou et al. 2013). When multiplying both sides of (1.3) by \({\varvec{U}}_h^{\prime } \), we have

$$\begin{aligned} {\varvec{U}}_h^{\prime } {\varvec{y}}=\mu {\varvec{U}}_h^{\prime } \mathbf{1} +{\varvec{U}}_h^{\prime } \left( {{\varvec{U}}_h {\varvec{S}}_h {\varvec{U}}_h^{\prime } } \right) {\varvec{f}}+{\varvec{U}}_h^{\prime } {\varvec{\varepsilon }} . \end{aligned}$$

Let \({\varvec{d}}={\varvec{d}}\left( {\varvec{y}} \right) ={\varvec{U}}_h^{\prime } {\varvec{y}}, {\varvec{t}}= {\varvec{U}}_h^{\prime } \mathbf{1},{\varvec{b}}={\varvec{U}}_h^{\prime } {\varvec{f}}\), \(\tilde{{\varvec{\varepsilon }} } ={\varvec{U}}_h^{\prime } {\varvec{\varepsilon }} \). Since \({\varvec{U}}_h^{\prime } {\varvec{U}}_h ={\varvec{U}}_h {\varvec{U}}_h^{\prime } ={\varvec{I}}_n \), the model becomes

$$\begin{aligned} {\varvec{d}}=\mu {\varvec{t}}+{\varvec{S}}_h {\varvec{b}}+ \tilde{{\varvec{\varepsilon }}} . \end{aligned}$$

By assuming that \({\varvec{\varepsilon }} \sim N\left( {\mathbf{0},{\varvec{I}}_n \sigma ^{2}} \right) \), it follows that \(\tilde{{\varvec{\varepsilon }} } \sim N\left( {\mathbf{0},{\varvec{I}}_n \sigma ^{2}} \right) \). Then, the joint distribution of transformed data \({\varvec{d}}\), given \(h, {\varvec{b}}\) and \(\sigma ^{2}\), is

$$\begin{aligned} p\left( {{\varvec{d}}|\mu ,{\varvec{b}},\sigma ^{2},h} \right) = \mathop \prod \limits _{i=1}^n N(d_i |\mu t_i +s_i b_i ,\sigma ^{2}). \end{aligned}$$

Note that (2.1) may be simply expressed as

$$\begin{aligned} d_i =\mu t_i +s_i b_i + \tilde{\varepsilon } _{i} ,(i=1,\ldots ,n). \end{aligned}$$

In the theory of inverse problems, this model is known as the sequential spatial model (Cavalier 2008). As will be seen later, this representation makes it easy to draw samples from the joint posterior distribution of the target parameters. Estimation of the target parameters in the original model is done directly through the inverse transformation \({\varvec{f}}={\varvec{U}}_h {\varvec{b}}\) and \({\varvec{u}}={\varvec{U}}_h {\varvec{S}}_h {\varvec{b}}.\)

2.2 Joint Posterior Distribution

If the prior of \({\varvec{f}}\) is \(N\left( {\mathbf{0},\sigma _a^2 {\varvec{K}}_h^{-1} } \right) \), where \(\sigma _a^2 \) is the additive genetic variance, then it can be assumed that \(\sigma _a^2 =\varphi \sigma ^{2}\), with \(\varphi >0\) (note that \(\varphi =1/\gamma )\), such that the prior distribution of \({\varvec{b}}={\varvec{U}}_h^{\prime } {\varvec{f}}\) is \(N\left( {{\varvec{b}}|\mathbf{0},\varphi \sigma ^{2}{\varvec{S}}_h^{-1} } \right) \). We also assume that the scale hyperparameter \(\varphi \) is an unknown quantity such that we should assign a prior distribution to \(\varphi \). Therefore, the joint posterior distribution of \({\varvec{b}}, \mu , \sigma ^{2}, \varphi \), and h in (2.1) is given by

$$\begin{aligned} p\left( {\mu ,{\varvec{b}},\sigma ^{2},h,\varphi |{\varvec{y}}} \right)\propto & {} N\left( {{\varvec{d}}|\mu {\varvec{t}}+{\varvec{S}_{h}{\varvec{b}}}, \sigma ^{2}{\varvec{I}}} \right) p\left( \mu \right) N\left( {{\varvec{b}}|\mathbf{0}, \varphi \sigma ^{2}{\varvec{S}_{h}}^{-1}} \right) \nonumber \\&\chi ^{-2}\left( {\sigma ^{2}|\upsilon _\varepsilon ,\tau _\varepsilon } \right) p\left( h \right) p\left( \varphi \right) , \end{aligned}$$

where the prior for \(\mu \) is uniform and the prior for \(\sigma ^{2}\) is the inverse scaled chi-squared with \(\upsilon _\varepsilon \) degrees of freedom (df) and scale parameter \(\tau _\varepsilon \); the priors for \(\varphi \) and h are described below.

2.3 Bandwidth Estimation

Selection of bandwidth h is a challenging problem. The ideal solution is to consider h as an unknown quantity and obtain its marginal distribution through Bayes’ rule, which implies approximation of the joint posterior distribution of \(\left( {\mu ,{\varvec{b}},\sigma ^{2},\varphi ,h} \right) \) via Markov Chain Monte Carlo (MCMC) or other computational tools. However, any simulation method requires recalculating the kernel at each iteration, which is computationally intensive.

Here, we propose fixing h and \(\varphi \) to the mode (\( \tilde{h},\tilde{\varphi }\)) of their joint posterior distribution \(p(h,\varphi |{\varvec{y}})\), which is given by

$$\begin{aligned} p\left( {h,\varphi |{\varvec{y}}} \right)= & {} \int p\left( {{\varvec{\theta }} ,h,\varphi |{\varvec{y}}} \right) \mathrm{d}{\varvec{\theta }} =\int \frac{L\left( {{\varvec{\theta }} ,h,\varphi |{\varvec{y}}} \right) p\left( {{\varvec{\theta }} , h,\varphi } \right) }{m\left( {\varvec{y}} \right) }\mathrm{d}{\varvec{\theta }}\nonumber \\= & {} \frac{p\left( h \right) p\left( \varphi \right) m\left( {{\varvec{y}}| h,\varphi } \right) }{m\left( {\varvec{y}} \right) }\int \frac{L\left( {{\varvec{\theta }} ,h,\varphi |{\varvec{y}}} \right) p\left( {{\varvec{\theta }} |h,\varphi } \right) }{m \left( {{\varvec{y}}|h,\varphi } \right) }\mathrm{d}{\varvec{\theta }} \propto p\left( h \right) p\left( \varphi \right) m\left( {{\varvec{y}}|h, \varphi } \right) , \end{aligned}$$

where \({\varvec{\theta }} =\left( {\mu ,{\varvec{b}},\sigma ^{2}} \right) ^{\prime }, L\left( {{\varvec{\theta }} ,h,\varphi |{\varvec{d}}} \right) \) is the likelihood function, the joint prior is of the form \(p\left( {{\varvec{\theta }} ,h,\varphi } \right) =p\left( {{\varvec{\theta }} |h,\varphi } \right) p\left( h \right) p\left( \varphi \right) \), and \(m\left( {{\varvec{y}}|h,\varphi } \right) =\int L\left( {{\varvec{\theta }} ,\varphi ,h|\varvec{y}} \right) p\left( {{\varvec{\theta }} |h,\varphi } \right) \mathrm{d}{\varvec{\theta }} \) is the predictive distribution given h and \(\varphi .\)

Since the posterior distribution \(p\left( {h,\varphi |{\varvec{y}}} \right) \) is proportional to marginal likelihood \(m\left( {{\varvec{y}}|h,\varphi } \right) \) times joint prior density \(p\left( h \right) p\left( \varphi \right) \), the first step is to analytically derive \(m\left( {{\varvec{y}}|h,\varphi } \right) \); then, the posterior mode of \(p\left( {h,\varphi |{\varvec{y}}} \right) \) can be estimated. Note that the problem reduces to finding the maximum of a function of only two parameters (\(h,\varphi )\).

2.4 The Marginal Likelihood of the Bandwidth

To obtain the marginal distribution of h from (1.3), we integrate out \({\varvec{\theta }} \) from the likelihood using the prior density as a weighting function (Berger et al. 1999; Maruyama and George 2011) such that the marginal likelihood of \({\varvec{y}}\) given (\(h,\varphi )\) is

$$\begin{aligned} m\left( {{\varvec{y}}|h,\varphi } \right) = \int _0^\infty \int _{\mathbb {R}^{n}} \int _{-\infty }^\infty p({\varvec{d}}|\mu ,{\varvec{b}}, \sigma ^{2},h)p\left( {\mu ,{\varvec{b}},\sigma ^{2}} \right) \mathrm{d}\mu \mathrm{d}{\varvec{b}}\mathrm{d}\sigma ^{2}, \end{aligned}$$

where \(p\left( {\mu ,{\varvec{b}},\sigma ^{2}} \right) \) is the joint prior distribution of \(\left( {\mu ,{\varvec{b}},\sigma ^{2}} \right) \), which are assumed to be independent, that is, \(p\left( {\mu ,{\varvec{b}},\sigma ^{2}} \right) =p\left( \mu \right) p\left( {\varvec{b}} \right) p\left( {\sigma ^{2}} \right) \). Assigning uniform, Gaussian, and inverted scaled Chi-square prior distributions to \(\mu , {\varvec{b}}\) and \(\sigma ^{2}\), respectively, it follows that the marginal likelihood is (see Appendix A):

$$\begin{aligned} m\left( {{\varvec{y}}|h,\varphi } \right) \propto \mathop \prod \limits _{i=1}^n (1+\varphi s_i)^{-\frac{1}{2}} \left[ \tau _{\varepsilon }+\sum _{i=1}^n \frac{\tilde{d}_i^2}{(1+\varphi s_i)}\right] ^{\frac{\nu _{\varepsilon }+n-1}{2}}, \end{aligned}$$

where \(\tilde{d} _i \) is the ith element of the vector \(\tilde{{\varvec{d}}} ={\varvec{U}}_h^{\prime } \left( {{\varvec{y}}-\bar{y} \mathbf{1}} \right) \) and \(s_i \) is the ith eigenvalue of the spectral decomposition of \({\varvec{K}}_h \). Then, \(m\left( {{\varvec{y}}|h,\varphi } \right) \) depends implicitly on h. We propose finding the optimal value of h and \(\varphi \) by maximizing \(p\left( {h,\varphi |{\varvec{y}}} \right) \), which implies Bayesian estimators under 0–1 loss function for h and \(\varphi \).

2.5 Prior Distributions of h and \(\varphi \)

An appropriate prior for h may be a Gamma distribution \(p\left( h \right) \sim Ga\left( {h|\upsilon ,\tau } \right) ,\)where \(\upsilon \) should be greater than 2 to avoid infinite prior variance; by fixing \(\upsilon \) and the prior mean \(\mu _h \), the scale parameter \(\tau \) is given by \(\nu /{\mu _h }\). Based on the KA method of de los Campos et al. (2010), the idea is to find an interval of values of h that would tend, at one extreme, to have the \({\varvec{K}}_h \) matrix as a diagonal matrix (local bandwidth) and, at the other extreme, to have the \({\varvec{K}}_h \) matrix as a matrix of ones (global bandwidth), which would allow an appropriate interval to be computed. For \(\varphi \) we should assign a proper prior with support on the positive real numbers; then, an alternative is the uniform prior on (0, B), where \(B >0\) is large enough to include values with high posterior density.

2.6 Estimation of the Posterior Mode of h and \(\varphi \)

The posterior distribution of \(\left( {h,\varphi } \right) \) is \(p\left( {h,\varphi |{\varvec{y}}} \right) \propto m({\varvec{y}}|h,\varphi )Ga\left( {h|\upsilon ,\tau } \right) p\left( \varphi \right) \) and its mode may be numerically estimated by the values of \(\left( {h,\varphi } \right) \) within an area that produces the highest value of the posterior density. As demonstrated later, the evaluation of \(p\left( {h,\varphi |{\varvec{y}}} \right) \) does not require computation of repetitive matrix inverses or multiplications, such that the joint posterior mode \(\left( {\tilde{h} ,\tilde{\varphi } } \right) \) can be determined by a numerical approximation.

2.7 Gibbs Sampler

It is difficult to directly get samples from the joint posterior distribution in (2.3), as it does not have a closed form. However, it is possible to obtain the closed forms for the conditional distributions of the parameters. This allows using Markov Chain Monte Carlo (MCMC) through the Gibbs Sampler (Gelfand and Smith 1990) algorithm, which samples sequentially from the full conditional distribution until it reaches a stationary process, converging to the joint posterior distribution.

We carried out convergence and diagnostic tests on different datasets. The Gelman–Rubin convergence tests (Gelman and Rubin 1992) for the model were satisfactory. The Raftery–Lewis test (Raftery and Lewis 1992) suggested a small burn-in period and that the number of iterations should be between 10,000 and 20,000 for the datasets used. With the aim of decreasing the potential impact of MCMC errors on prediction accuracy, we performed a total of 60,000 iterations with a burn-in of 10,000 and a thinning of 10, so that 5,000 samples were used for inference. The Gelman–Rubin convergence tests and the Raftery–Lewis tests were performed using the Convergence Diagnosis and Output Analysis (CODA) R package for MCMC (Plummer et al. 2006). The Effective Sample Size (ESS) varied between 1,400 and 2,000 for all parameters.

2.8 Data and Software Repository

An R script (R Core Team 2015) to implement the Gibbs Sampler described above and a brief document in Help and Programs.zip file can be downloaded from the following link http://hdl.handle.net/11529/10234 together with the file Data.zip containing the datasets used in this study. Supplemental Tables S1 and S2 can also be found at this link.

3 Experimental Data

The two data sources used in this study were previously used by several authors (Crossa et al. 2010; Pérez-Rodríguez et al. 2012; González-Camacho et al. 2012) and most recently by Cuevas et al. (2014) for fitting several Bayesian inverse regression methods.

3.1 Wheat DataSet

This dataset included 306 elite wheat lines from CIMMYT’s Global Wheat Program that were used by Pérez-Rodríguez et al. (2012). These lines were genotyped with 1,717 diversity array technology (DArT) markers generated by Triticarte Pty. Ltd. (Canberra, Australia; http://www.triticarte.com.au). These DArT corresponds to scores for presence (1) and absence (0) for dominant marker (see Wenzl et al. 2004, for further details). Two traits were analyzed: days to heading (DTH) measured in ten different environments and grain yield (GY) measured in seven different environments.

3.2 Maize DataSet

The maize data represented 21 trait-environment combinations for 300 tropical inbred lines genotyped with 55,000 SNPs each (González-Camacho et al. 2012). A first group of traits included female flowering (FFL) or days to silking, male flowering (MFL) or days to anthesis, and the anthesis-silking interval (ASI). Each trait was evaluated under severe drought stress (SS) and in well-watered (WW) environments. This dataset was also used by Crossa et al. (2010) for assessing prediction performance.

In the second group of traits, grain yield (GY) was obtained under severe drought (SS) and in well-watered (WW) environments. Further, GY of the 300 maize lines was also measured in a large number of relatively high yielding (GY-HI) and low yielding environments (GY-LO). In the third group of traits, the 300 maize lines were also evaluated in nine international environments for gray leaf spot (GLS), a disease caused by the fungus Cercospora zeae-maydis. Finally, in the fourth group, the same 300 lines were evaluated in another set of trials for northern corn leaf blight (NCLB), a disease caused by the fungus Exserohilum turcicum.

3.3 Assessing Prediction Accuracy Using Cross-Validation

Once the value of h was estimated, its predictive ability was evaluated using the TGK model, where conditional distributions were computed according to (2.3). Model predictions for each of the maize and wheat datasets were done in each of 50 random partitions, with 90 % of individuals in the training set and 10 % of individuals in the testing set to be predicted. We used the same 50 random partitions as those used by Pérez-Rodríguez et al. (2012), González-Camacho et al. (2012), and Cuevas et al. (2014). We computed both Pearson’s correlation coefficient between the posterior predictive mean and the corresponding observed values and the Predictive Mean Squared Error (PMSE) as measures of prediction ability.

The GK REML method included in the rrBLUP package (Endelman 2011) was fitted and the mean Pearson’s correlation between predicted and observed values and the PMSE was also calculated for the same 50 random partitions. Therefore, Pearson’s correlation between predicted and observed values and the PMSE for the same 50 random partitions were computed for the TGK model, GK REML, and RKHS KA, as reported by Pérez-Rodríguez et al. (2012) for the wheat dataset, and by González-Camacho et al. (2012) for the maize dataset.

3.4 Assessing the Significant Differences Among the Models (Methods)

We separated the traits in the wheat datasets into two subsets, those comprising the DTH traits and those including the YLD traits. The trait-environment combinations of the maize dataset included flowering time (FFL, FML, ASI) and grain yield (GY) in different stress environments (WW, SS, HI and LOW) and those including the disease traits GLS and NCBL measured in different environments. The significant differences among the methods and models were assessed based on the two criteria, correlation between the predicted and observed values and the PMSE. We computed two-tailed hypothesis testing of the three models, TGK, RKHS KA, GK REML (using a Type I error rate of 5 %) to examine significant differences between models based on correlation and PMSE criteria.

3.5 Full Data Analyses

For each dataset (trait-environment combination), we computed the corresponding parameters using the full data. For the TGK model, we used three different priors with the objective of examining the influence of each of them on the predictive ability of the TGK model. We fitted the TGK model using, as priors for h, Gamma distributions with mean 2 (TGK2) and 1 (TGK1), and with a flat prior (TGKF). The inferences for TKG and RKHS KA were based on 30,000 MCMC samples that were obtained after discarding 5,000 samples (burn-in).

For the TGK model, a Gamma prior distribution for h with mean 2 was used. For GK REML, the default values given in the rrBLUP R package for the grid were used. For RKHS KA, the values used were taken directly from Table 2 in Pérez-Rodríguez et al. (2012) for the wheat datasets and from Table 1 in González-Camacho et al. (2012) for the maize datasets (see footnote in Supplemental Tables S1 and S2 of this study).

4 Results

4.1 Full Data Analysis

Supplemental Tables S1 and S2 (see http://hdl.handle.net/11529/10234) contain the results for TGK1, TGK2, and TGKF. In general, it can be seen that R, defined as the ratio [genomic variance]/[genomic + error variance], for methods TGK1, TGK2, and TGKF are very similar; therefore, the prior should have very little impact on the bandwidth parameter h. Estimates of the bandwidth parameter depend on the data and traits. For the wheat data (DTH and GY), the estimates of h varied around 1 and 2 (Supplemental Table S1), whereas for the maize dataset, the estimates of h from the full data varied from 0.27 to 3.9 (Supplemental Table S2).

4.2 Wheat DataSet

The results of the two-tailed hypothesis tests all conducted based on a Type I error rate of 5 % are shown in Table 1 for the mean PMSE and the mean correlation across 50 random partitions for a total of 17 trait-environment combinations. For the PMSE criterion, TGK was significantly more accurate than GK REML in seven trait-environments combinations (DTH2, DTH8-11, YLD1, and YLD3) and significantly more accurate than RKHS KA in 14 cases. Also, GK REML gave significantly higher prediction accuracy than TGK in four cases (DTH4, DTH12, YLD4, and YLD7). RKHS KA was significantly the best in only one case (YLD2).

Based on correlations, TGK was significantly more accurate than RKHS KA and GK REML in seven trait-environment combinations (the same as those detected as significant based on the PMSE criterion, except for YLD1 and DTH2) (Table 1). In only one case (DTH4), was GK REML significantly more accurate, based on correlation, than TGK.

Table 1 Mean Predictive Mean Squared Error (PMSE) and correlation from 50 random cross-validation partitions (and their standard deviation, SD) of TGK, RKHA KA, and GK REML for various trait-environment combinations of the wheat dataset.
Table 2 Mean Predictive Mean Squared Error (PMSE) and correlation and from 50 random cross-validation partitions (and their standard deviation, SD) of TGK, RKHA KA, and GK REML for various trait-environment combinations of the maize dataset.

4.3 Maize DataSet

The mean PMSE and correlations across 50 random partitions and the significant differences between them based on two-tailed hypothesis tests (Type I error rate = 5 %) are given in Table 2 for a total of 21 trait-environment combinations. Fewer significant differences between the methods were found for the maize data compared to the wheat data; for nine trait-environment combinations, no significant differences among any of the three methods were detected.

Based on PMSE, TGK was the most accurate method together with either RKHS KA or GK REML in four cases and was significantly the solely best predictive method only once (GLS7). GK REML was the best method in three cases (GY-LOW, GLS6, and NCBL2); it tied for the highest position with RKHS KA four times, and with TGK three times. GK REML was a significantly better predictor than TGK in seven cases.

Based on the correlation criterion, TGK shared the highest ranking position with RKHS KA five times and was significantly the best predictive method for GLS6 and NCBL2 (Table 2). RKHS KA was the best predictive method 10 times, sharing the top position with other models 6 times, and was the best model overall on four occasions. GK REML showed a pattern similar to that of TGK, that is, it was the most significant predictive method six times; it shared the top ranking with other methods for three trait-environment combinations, and in three cases it was the most significant predictive method overall (GY-WW, GLS3, and GLS8).

5 Discussion

The marginal likelihood is important in selecting h and \(\varphi \) can be easily computed after eliminating the nuisance parameters through integration with the joint prior of these parameters (Berger et al. 1999). In this study, the assumption that \(p\left( \mu \right) \) has a uniform distribution and that \(\sigma _a^2 =\varphi \sigma ^{2}\) allows developing an expression such as the one shown in (2.3). However, expression (2.3) still has a nuisance parameter (\(\varphi )\) that is difficult to eliminate by analytical integration. An alternative solution may be Monte Carlo integration, but this could produce an unstable solution.

5.1 Similarities and Differences Between REML and the Proposed Bayesian Estimation of h

It should be pointed out that although we solved the problem of finding estimates of h and \(\varphi \) by maximizing a function of the data, our approach is not the same as Endelman’s (2011) strategy. While Endelman (2011) uses the iterative REML approach, our approach simply follows the probability rules of finding the posterior density of h and \(\varphi \).

Blasco (2001) explained that given the variance components (\(\sigma ^{2},\sigma _a^2\)) and h, the BLUP of the vector of random effects \(\varvec{b}\) is equal to its conditional posterior mean (mode or median) because the conditional posterior is normal if the prior for \(\mu \) is flat or normal itself. So, BLUP of \(\varvec{b}\) is the posterior mean of \(p({\varvec{b}}|\sigma ^{2},\sigma _a^2 ,h)\) if \(\varvec{b}\) is included in the model as a Gaussian random effect with mean \(\mathbf{0}\) and variance \(\sigma _a^2 {\varvec{S}}^{-1}, \sigma _a^2 =\varphi \sigma ^{2}\) However, \(\sigma ^{2}\) and \(\sigma _a^2 \) are unknown, and a common way to estimate them, for a given h, is through REML which uses the marginal posterior modes as estimators. However, for both REML and Bayesian approaches, the problem of estimating h remains. It is important to recognize that h is not a variance component, though it is the scale parameter in the Gaussian kernel; that is, h is not the variance of a Gaussian random effect on the linear predictor. Therefore, it cannot be affirmed that the posterior mode of h is equal to the REML estimate, even though the corresponding estimates will be similar as both correspond to the same quantity in the same model.

It is worth mentioning that Endelman (2011) followed the data transformation proposed by Kang et al. (2008), such that \({{\varvec{V}}}'{\varvec{y}}=(u_1,u_2,\ldots ,u_n)'\), where \({\varvec{V}}\) and \(\lambda _i +\varphi ^{-1}\) are the \(n\times n\) matrix of eigenvectors and the ith eigenvalue of the spectral decomposition of \({\varvec{SHS}}\), respectively, with \({\varvec{S}}={\varvec{I}}-\mathbf{11}^{\prime }/n\) and \({\varvec{H}}={\varvec{K}}_h +\varphi ^{-1}{\varvec{I}}\). Then, the restricted likelihood (RL) to be optimized for computing the REML estimate of h is given by

$$\begin{aligned} \hbox {RL}=\frac{1}{2}\left[ {\left( {n-1} \right) \log \left( {2\pi } \right) -\log n- \sum _{i=1}^n \log \left( {\lambda _i +\varphi ^{-1}} \right) -Q/\left( {\varphi \sigma ^{2}} \right) } \right] \end{aligned}$$

with \(Q=\sum _{i=1}^n \left( {\frac{u_i^2 }{\lambda _i +\varphi ^{-1}}} \right) \). Note that as expression (2.4), RL also implicitly depends on h. However, the main difference between RL and (2.4) is that RL is a function of \(\sigma ^{2}\), while \(m\left( {{\varvec{y}}|h,\varphi } \right) \) is not. In Endelman’s method, \(\sigma ^{2}\) is assumed to be equal to \(\hat{\sigma } ^{2}=\hat{\sigma } _a^2 /\hat{\varphi } \), where \(\hat{\sigma } _a^2 \) and \( \hat{\varphi }\) are REML estimations, and this value is plugged into RL. On the other hand, in our Bayesian proposal, the function to be maximized is the marginal posterior density of \(h, p\left( {h|{\varvec{y}},\varphi } \right) \propto m\left( {{\varvec{y}}|h,\varphi } \right) p\left( h \right) \), which by definition does not depend on \(\sigma ^{2}\).

5.2 The Prior Distribution of h

The prior distribution of h may dominate the likelihood and its correct selection depends on the researcher’s knowledge of h for a particular trait. This previous knowledge of the values of h for certain traits can be used a priori to avoid using large values of h that may cause over-fitting, as suggested by Härdle (1990). It is important to note that large values of h generate a \({\varvec{K}}_h \) close to the identity matrix, with the obvious consequence of generating confusion in the signal/noise ratio. Furthermore, low values of h may produce \({\varvec{K}}_h \) matrices with all off-diagonal entries close to 1, which causes serious problems when fitting the model.

The correct selection of \(p\left( h \right) \) influences the signal/noise ratio, and a flat prior is suggested when nothing is known about h. With prior information, we suggest using a Gamma distribution as a prior for h, because this distribution is flexible enough to represent the prior knowledge of h. The parameters of the Gamma distribution may be selected by exploring matrix \({\varvec{K}}_h \) with different values of h and previous knowledge of the range of h values.

We used a Gaussian kernel of the form \(K_{ij} \left( {{\varvec{x}}_i ,{\varvec{x}}_j } \right) =\exp \left( {-hd_{ij}^2 /q_{0.05} } \right) \), where \(d_{ij}^2 \) is the squared Euclidean distance (González-Camacho et al. 2012), and \(q_{0.05} \) is the 0.05 quantile of this squared distance. The value \(q_{0.05} \) is used to scale the values of the kernel such that the values of h close to zero will originate a matrix \({\varvec{K}}_h \) near to one and large values of h will tend to produce an identity matrix for \({\varvec{K}}_h \). For example, for markers of the wheat data, Fig. 1 depicts the histograms of the upper off-diagonal elements of the Gaussian kernel for three values of h. Figure 1a shows a high frequency of high values of \({\varvec{K}}_h \) for \(h=0.2\), indicating a global bandwidth; Fig. 1c shows a high frequency of low values of \({\varvec{K}}_h \) for \(h=6\), and Fig. 1b, with \(h=1,\) shows intermediate off-diagonal values of \({\varvec{K}}_h \). It seems reasonable to think that there is a high probability that a Gamma distribution with mean 1 or 2 covers the range of values of h that generates intermediate values for the off-diagonal elements of \({\varvec{K}}_h \).

For the application of TGK in the maize and wheat datasets, we considered a Gamma distribution with \(\nu =3\), and a scale parameter \(\tau =1.5\). The R program that we provide in the Supplemental Material allows a researcher to choose a flat prior or proper Gamma distribution for h with scale parameters that the researcher considers to be appropriate.

Fig. 1
figure 1

ac Histograms of the off-diagonal entries for three kernels in the wheat dataset: a K1 with \(h=0.2;\) b K2 with \(h=1\), and c K3 with \(h=6\).

5.3 Prediction Accuracy of the Wheat DataSet

The wheat data may have genetic complexities that favor semi-parametric regression over parametric regression, as clearly shown by Pérez-Rodríguez et al. (2012); some of these complexities may be due to cryptic gene \(\times \) gene epistatic effects that might be only captured by a non-additive model. In general, TGK had better prediction accuracy more often than RKHS KA and GK REML for DTH and YLD wheat datasets. Supplemental Table S1 shows the estimated values of h and \(\varphi \) for the full data using TGK2 with a Gamma prior with mean 2, and RKHS KA and GK REML with a Gaussian kernel of the form \(K=\hbox {exp}\left( {-\frac{d_{ij}^2 }{h^{2}}} \right) \). Also, Supplemental Table S1 shows that in almost all cases, the residual variance from GK REML is smaller than those from the other methods and therefore produces larger R. On the other hand, RHKS KA tended to have a smaller R than the other methods, whereas TGK had intermediate values of R that were closer to those estimated by GK REML than those estimated by RHKS KA. This is related to the generally better prediction accuracy of TGK2 for the wheat dataset. An exception is trait DTH4 with a very low signal/noise ratio; the posterior of h is dominated by the prior distribution and thus the posterior mode is close to the prior mean (close to 2).

Depicted in Fig. 2a, b is the effect on the posterior of three different prior specifications for h given that \(\varphi = \hat{\varphi }\), which is approximately equal to the REML estimate. Any other proper prior on \(\mathbb {R}^{+}\) could certainly be used, but we used a Gamma prior in this study due mainly to practical considerations and mathematical convenience. The values of h in TGK2 (Supplemental Table S1) are intermediate, with moderate cross terms of \({\varvec{K}}_h \) similar to those observed in Fig. 2b; it can thus be assumed that the selection of the prior is adequate. Figure 2a, b depict the behavior of the posterior distribution for two trait-environment combinations of the wheat dataset (DTH2 and DTH4). For trait-environment combination DTH2, Fig. 2a shows the posterior distribution of h with the Gamma prior distribution with mean 2 (continuous line) and the Gamma prior distribution with mean 1 (dotted line). These results indicate that for the DTH2 wheat dataset, the posterior is not sensitive to the prior selection. Note in Supplemental Table S1 that the \(R^2\) for TGK2 of this trait-environment combination indicates a relatively high signal/noise rate.

Fig. 2
figure 2

Graphs a and b depict the behavior of the posterior of h for wheat data DTH2 and DTH4, respectively. In each graph, three priors of h are used: proper flat prior distribution (continuous line), Gamma distribution with \(\nu =3,\tau =1.5\) (dashed line) and Gamma distribution with \(\nu =3,\tau =3\) (dotted line).

On the other hand, Fig. 2b for trait-environment combination DTH4 shows the posterior distribution of h with the Gamma prior distribution with mean 2 (continuous line) and the Gamma prior distribution with mean 1 (dotted line). These results indicate that the posterior distribution of h is very sensitive to the prior distribution selection because the likelihood function does not give enough information about h and the remaining parameters of the kernel regression model (see R in Supplemental Table S1). In this case, the selection of an appropriate prior distribution of h will depend strongly on the prior knowledge about the value of h for this trait.

5.4 Prediction Accuracy of the Maize DataSet

For the maize datasets, the prediction accuracy of TGK was statistically similar or lower than the prediction accuracy of GK REML; for seven trait-environment combinations, GK REML was statistically better than TGK. Supplemental Table S2 shows the information after fitting TGK2, RKHS KA and GKREML with the full data. The pattern of results is similar to the patterns in the wheat dataset, with the residual variance lower in GK REML and higher in RKHS KA. The three models (TGK2, RKHS KA and GK REML) had very similar prediction accuracies, except in some specific trait-environment combinations. For example, flowering traits FFL-WW and MFL-WW, with the highest values of h (3.8914 and 3.5055), showed poor prediction performance under the TGK model, probably due to the bimodal distribution of the phenotypic data. Flowering traits such as FFL and MLF do not have a very complex genetic architecture and this may be another reason why the TGK2 model did not perform as well as the other models. Also, low values of h (0.2 and 0.4) produced a more global kernel and thus lowered the prediction ability of model TGK. Another case is NCBL1, where the values of h are low (0.2). However, despite the fact that the correlations of TGK2 were not all higher than those of the other two models, the PMSE are generally lower for methods GK REML and RKHS than for model TGK.

6 Conclusions

The proposed method for estimating h is based on a Bayesian formulation in which the selected value of h is a posterior point estimate of h, namely, the posterior mode \(\tilde{h} \) of the marginal distribution \(p(h|{\varvec{y}},\varphi )\). Then, conditional on \(\tilde{h} \), the TGK model is fitted and used for prediction. The selection of h produced good prediction accuracy of unobserved individuals. When the proposed selection of h is combined with model TGK, the computational limitations derived from the high dimensionality are removed, as well as the poor mixing of the MCMC typical of most Bayesian genomic selection procedures.

The Gamma distribution used as a prior for estimating h penalizes the high values of the bandwidth parameter. To elicit the Gamma prior for h, we fixed the form parameter (\(\varphi \)) and either the mean or the mode may be selected such that \(K_h \) does not tend towards the identity matrix or, at the other extreme, to a \(K_h \) matrix with all its entries equal to one. In the application for wheat and maize data, a mean of 2 was used, which represents moderate values of the cross terms in \({\varvec{K}}_h \).

When the proposed model (TGK) was applied to the wheat data, it indicated superiority in prediction accuracy over the RKHS KA and GK REML models. In the maize data, no superiority was achieved in some trait-environment combinations, due to the fact that some traits were bimodal and had different genetic architecture. Significant differences between models were less clear for the maize data; model GK REML had statistically lower PMSE than model TGK in seven cases, and, in general, both GK REML and RKHS KA had lower PMSE than model TGK.