Keywords

8.1 The Reproducing Kernel Hilbert Spaces (RKHS)

One of the main goals of genetic research is accurate phenotype prediction. This goal has largely been achieved for Mendelian diseases with a small number of risk variants (Schrodi et al. 2014). However, many traits (like grain yield) have a complex genetic architecture that is not well understood (Golan and Rosset 2014). Phenotype prediction for such traits remains a major challenge. A key challenge in complex phenotype prediction is accurate modeling of genetic interactions, commonly known as epistatic effects (Cordell 2002). In recent years, there has been mounting evidence that epistatic interactions are widespread throughout biology (Moore and Williams 2009; Lehner 2011; Hemani et al. 2014; Buil et al. 2015). It is well accepted that epistatic interactions are biologically plausible, on the one hand (Zuk et al. 2012), and are difficult to detect, on the other hand (Cordell 2009), suggesting that they may be highly influential in our limited success in modeling complex heritable traits.

Reproducing Kernel Hilbert Spaces (RKHS) regression was one of the earliest statistical machine learning methods suggested for use in plant and animal breeding (Gianola et al. 2006; Gianola and van Kaam 2008) for the prediction of complex traits. An RKHS is a Hilbert space of functions in which all the evaluation functionals are bounded linear functionals. The fundamental idea of RKHS methods is to project the given original input data contained in a finite-dimensional vector space onto an infinite-dimensional Hilbert space. The kernel method consists of transforming the data using a kernel function and then applying conventional statistical machine learning techniques to the transformed data, hoping for better results. Methods based on implicit transformations (RKHS methods) have become very popular for analyzing nonlinear patterns in data sets from various fields of study. Furthermore, the introduction of kernel functions has become an efficient alternative to obtain measures of similarity between objects that do not have a natural vector representation. Although the best known application of kernel methods is Support Vector Machines (SVM), which is studied in the next chapter, lately it has been shown that any learning algorithm based on distances between objects can be formulated in terms of kernel functions, applying the so-called “kernel trick.” However, RKHS methods are not limited to regression; they are also really powerful for classification and data compression problems and theoretically sound for dealing with nonlinear phenomena in general. For these reasons, they have found a wide range of practical applications ranging from bioinformatics to text categorization, from image analysis to web retrieval, from 3D reconstruction to handwriting recognition, and from geostatistics to chemoinformatics. The increase in popularity of kernel-based methods is also due in part to the fact that they provide a rich way to capture nonlinear patterns in data that cannot be captured with conventional linear statistical learning methods. In genomic selection, the application of RKHS methods continues to increase, for example, Long et al. (2010) found better performance of RKHS methods over linear models in body weight of broiler chickens. Crossa et al. (2010) compared RKHS versus Bayesian Lasso and found that RKHS was better than Bayesian Lasso in the wheat data set, but a similar performance of both methods was observed in the maize data set. Cuevas et al. (2016, 2017, 2018) found superior performance of RKHS methods over linear models using Gaussian kernels on data of maize and wheat. Cuevas et al. (2019) also found that when using pedigree, markers, and near-infrared spectroscopy (NIR) data (which is an inexpensive and nondestructive high-throughput phenotyping technology for predicting unobserved line performance in plant breeding trials), kernel methods (Gaussian kernel and arc-cosine kernel) outperformed linear models in terms of prediction performance. However, other authors found minimal differences between RKHS methods and linear models, for example, Tusell et al. (2013) in litter size in swine, Long et al. (2010) and Morota et al. (2013) in progeny tests of dairy sires, and Morota et al. (2014) in phenotypes of dairy cows. These publications have empirically shown equal or better prediction ability of RKHS methods over linear models. For this reason, the applications of kernel methods in GS are expected to continue increasing since they can be implemented in current software of genomic prediction and because they are (a) very flexible, (b) easy to interpret, (c) theoretically appealing for accommodating cryptic forms of gene action (Gianola et al. 2006; Gianola and van Kaam 2008), (d) these methods can be used with almost any type of information (e.g., covariates, strings, images, and graphs) (de los Campos et al. 2010), (e) computation is performed in an n-dimensional space even when the original input information has more columns (p) than observations (n) thus avoiding the p ≫ n problem (de los Campos et al. 2010), (f) they provide a new viewpoint whose full potential is still far from our understanding, and (g) they are very attractive due to their computational efficiency, robustness, and stability.

The goal of this chapter is to give the user (student or scientist) a friendly introduction to regression and classification methods based on kernels. We also cover the essentials of kernels methods, and with examples, we show the user how to handcraft an algorithm of a kernel for applications in the context of genomic selection.

8.2 Generalized Kernel Model

Like any regression problem, a generalized kernel model assumes that we have pairs (yi, xi) for i = 1, …, n, where yi and xi are the response variable and the vector of independent variables (pedigree of marker data) measured in individual i, and the relationship between yi and xi is given by

$$ \mathrm{Distribution}:{y}_i\sim p\left({y}_i|{\mu}_i\right) $$
$$ \mathrm{Linear}\ \mathrm{predictor}:{\eta}_i=f\left({\boldsymbol{x}}_i\right)={\eta}_0+{\boldsymbol{k}}_i^{\mathrm{T}}\boldsymbol{\beta} $$
$$ \mathrm{Link}\ \mathrm{function}:{\eta}_i=g\left({\mu}_i\right) $$

where g(.) is a known link function, μi = h(ηi), h(.) denotes the inverse link function,\( f\left({\boldsymbol{x}}_i\right)={\eta}_0+{\boldsymbol{k}}_i^{\mathrm{T}}\boldsymbol{\beta}, {\eta}_0 \) is an intercept term, ki = [K(xi, x1), …, K(xi, xn)]T, K(., .) is the kernel function, and β = (β1, …, βn)T is an n × 1 vector of coefficients. This generalized kernel model provides a unifying framework for kernel-based analyses for dealing with continuous, binary, categorical, and count data, since with different p(yi| μi) and g(.), we have different models. It is very interesting to point out that under the kernel framework, the problem is reduced to finding n regression coefficients instead of p, as in conventional regression models, thus avoiding the problem of having to solve a regression problem with p ≫ n. Also, kernel methods are very useful when genotypes and phenotypes are connected in ways that are not well addressed by the linear additive models that are standard in quantitative genetics.

8.2.1 Parameter Estimation Under the Frequentist Paradigm

Inferring f requires defining a collection (or space) of functions from which an element, \( \hat{f} \), will be chosen via a criterion. Specifically, in RKHS, estimates are obtained by solving the following optimization problem:

$$ \underset{f\in H\ }{\underbrace{\min}}\left\{\frac{1}{n}\sum \limits_{i=1}^nL\left({y}_i,f\left({\boldsymbol{x}}_i\right)\right)+\lambda {\left\Vert f\right\Vert}_H^2\right\}, $$
(8.1)

which mean that the optimization problem is performed within the space of functions H, a RKHS, f ∈ H and ‖fH denotes the norm of f in Hilbert space H; L(yi, f(xi)) is some measure of goodness of fit, that is, a loss function viewed as the negative conditional log-likelihood, which should be chosen in agreement with the type of response variable. For example, for continuous outcomes, this should be constructed in terms of Gaussian distributions, when the response variable is binary in terms of Bernoulli distributions, when the response is count in terms of Poisson or negative binomial distribution, and when it is categorical in terms of multinomial distributions; λ is a smoothing or regularization parameter that should be positive and should control the trade-off between model goodness of fit and complexity; and \( {\left\Vert f\right\Vert}_H^2 \) is the square of the norm of f(xi) on H, a measure of model complexity (de los Campos et al. 2010). Hilbert spaces are complete linear spaces endowed with a norm that is the square root of the inner product in the space. The Hilbert spaces that are relevant for our discussion are RKHS of real-valued functions, here denoted as H. Those interested in more technical details of RKHS of real functions should read Wahba (1990). By the representer theorem (Wahba 1990), which tells us that the solutions to some regularization functionals in high or infinite-dimensional spaces fall in a finite-dimensional space, the solution for (8.1) admits a linear representation

$$ f\left({\boldsymbol{x}}_i\right)={\eta}_0+\sum \limits_{j=1}^n{\beta}_jK\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_j\right)={\eta}_0+{\boldsymbol{k}}_i^{\mathrm{T}}\boldsymbol{\beta}, $$
(8.2)

where η0 is an intercept term, K(·, ·) is the kernel function, ki = [K(xi,x1), …, K(xi,xn)]T as defined before, and βj are beta coefficients. Notice that \( {\left\Vert f\right\Vert}_H^2=\sum \limits_{l,j=1}^n{\beta}_l{\beta}_jK\left({\boldsymbol{x}}_l,{\boldsymbol{x}}_j\right), \) and by substituting (8.2) into (8.1), we obtain the minimization problem under a frequentist framework with respect to η0 and β, as did Gianola et al. (2006) and Zhang et al. (2011):

$$ \underset{\eta_0,\boldsymbol{\beta}\ }{\underbrace{\min}}\left\{\frac{1}{n}\sum \limits_{i=1}^nL\left({y}_i,{\eta}_0+{\boldsymbol{k}}_i^{\mathrm{T}}\boldsymbol{\beta} \right)+\frac{\lambda }{2}{\boldsymbol{\beta}}^{\mathrm{T}}\boldsymbol{K}\boldsymbol{\beta } \right\}, $$
(8.3)

where K = [k1, …, kn] is the n × n kernel matrix with ki as defined above. Since K needs to be symmetric and positive semi-definite, the term βT is an empirical RKHS norm with regard to the training data, λ is a smoothing or regularization parameter that should be positive and should control the trade-off between model goodness of fit and complexity, and the factor \( \frac{1}{2} \) is introduced for convenience. The second term of (8.3) acts as a penalization term that is added to the minus log-likelihood. The goal is to find η0 and β, which is equivalent to finding \( f\left({\boldsymbol{x}}_i\right)={\eta}_0+{\boldsymbol{k}}_i^{\mathrm{T}}\boldsymbol{\beta} \) that minimizes (8.3). f(xi) is based on a basis expansion of kernel functions and this relationship \( f\left(\boldsymbol{x}\right)={\eta}_0+{\boldsymbol{k}}_i^{\mathrm{T}}\boldsymbol{\beta} \) is due to the representer theorem (Wahba 1990). Therefore, model specification under the generalized RKHS methods depends on the choice of loss function L(..), the Hilbert space H to build K, and the smoothing parameter λ. The smoothing parameter λ can be chosen by cross-validation or generalized cross-validation under the frequentist framework or by specifying a prior distribution for the β coefficients under the Bayesian framework (Gianola and van Kaam 2008). It is important to point out that when the response variable is coded as yi ∈ {−1, 1} and the hinge function is used as the loss function, the problem to solve is the standard support vector machine (Vapnik 1998), which is studied in the next chapter.

8.2.2 Kernels

A kernel function converts information on a pair of subjects into a quantitative measure representing their similarity with the requirement that the function must create a symmetric positive semi-definite (psd) matrix when applied to any subset of subjects. The psd requirement ensures a statistical foundation for using the kernel in penalized regression models. From a statistical perspective, the kernel matrix can be viewed as a covariance matrix, and we later show how this aids in the construction of kernels. Kernels are used to nonlinearly transform the input data x1, …, xn ∈ X into a high-dimensional feature space. Next, we provide a definition of kernel function.

Kernel function. Kernel function K is a “similarity” function that corresponds to an inner product in some expanded feature space that for all xi, xj ∈ Χ satisfies

$$ K\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_j\right)=\varphi {\left({\boldsymbol{x}}_i\right)}^{\mathrm{T}}\varphi \left({\boldsymbol{x}}_j\right), $$

where φ is a mapping (transformation) from X to an (inner product) feature space F, φ : x → φ(x). From this definition, we can see that the kernel has the following properties:

  1. 1.

    It is a symmetric function of its argument so that K(xi, xj) = K(xj, xi).

  2. 2.

    A necessary and sufficient condition for a function K(xi, xj) to be a valid kernel (Shawe-Taylor and Cristianini 2004) is that the Gram matrix, also called kernel matrix K, whose elements are given by K(xi, xj), should be positive semi-definite for all possible choices of x1, …, xn ∈ X.

  3. 3.

    Kernels are all those functions K(u, v) that verify Mercer’s theorem, that is, for which

    $$ \underset{\boldsymbol{u},\boldsymbol{v}}{\int }K\left(\boldsymbol{u},\boldsymbol{v}\right)g\left(\boldsymbol{u}\right)g\left(\boldsymbol{v}\right)d\boldsymbol{u}d\boldsymbol{v}>0 $$

    for all g() square-integrable functions.

Mercer’s theorem is an equivalent formulation of the finitely positive semi-definite property for vector spaces. The finitely positive semi-definite property suggests that kernel matrices form the core data structure for kernel methods technology. By manipulating kernel matrices, one can tune the corresponding embedding of the data in the kernel-defined feature space.

Next, we give an example of the utility of a kernel function and how this works. We assumed that we measured a sample of n plants with two independent variables (x1, x2) and one binary dependent variable (y), that is, (x11, x21, y1), …, (x1n, x2n, yn). Then we plotted the observed data in Fig. 8.1 (left panel), and since the response variable is binary, we used triangles for denoting diseased plants and crosses for non-diseased plants. The goal is to build a classifier for unseen data using the data given in Fig. 8.1 as the training set. It is not possible to create a linear decision boundary to separate both types of plants (diseased vs. non-diseased) since the true decision boundary is an ellipse in predictor space (Fig. 8.1, left panel). The job of a kernel consists of estimating this boundary by first transforming (mapping) the input information (predictors) via a nonlinear mapping function into a feature map, where the problem can be reduced to estimating a hyperplane (linear boundary) between the diseased and non-diseased plants. We mapped the input information (Fig. 8.1, left panel) to the feature space using the following nonlinear map \( \varphi \left(\mathbf{x}\right)=\left({z}_1={x}_1^2,{z}_2={x}_2^2,{z}_3=\sqrt{2}{x}_1{x}_2\right) \) (Fig. 8.1, right panel) and the ellipse became a hyperplane that is parallel to the z3 axis, which means that all points are plotted on the (z1, z2) plane. Therefore, in the feature space, the problem reduces to estimating a hyperplane from the mapped data points.

Fig. 8.1
figure 1

Mapping of the two predictor (x1, x2) problems with binary dependent variables (y; crosses = 1 and triangles = 0) where the true decision boundary is an ellipse in predictor space (left panel) to a feature map via nonlinear mapping, φ(x). Input space (left panel) and feature space (right panel)

For this reason, in generalized kernel models, the choice of the kernel function (H) is of paramount importance since it defines the space of functions over which the search for f is performed, and because Hilbert spaces are normed spaces (Akhiezer and Glazman 1963). As mentioned above, by choosing H one automatically defines the reproducing kernel (K) which should be at least a psd matrix (de los Campos et al. 2010). There are two main properties that are required for the successful implementation of a kernel function. First, it should capture as precisely as possible the measure of similarity to the particular task and domain, and second, its construction should require significantly less computational resources than would be needed for an explicit evaluation of the corresponding feature mapping, φ.

We will call the original input information (X) the input space. Then, with the kernel approach, we define a function for each pair of elements (columns) in this space X that corresponds to a real value. The transformed feature information with the kernel function is called mapped feature space.

8.2.3 Kernel Trick

By kernel trick we mean the use of kernel functions to operate in a high-dimensional space, implicit feature space, without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space. This operation is often computationally cheaper than the explicit computation of the coordinates. This means that the kernel trick allows you to perform algebraic operations in the transformed data space efficiently and without knowing the transformation φ. For this reason, the kernel trick is a computational trick used to compute inner products in higher dimensional spaces at a low cost. Thus, in principle, any statistical machine learning technique for data in X ⊂ n that can be formulated in a computational algorithm in terms of dot products can be generalized to the transformed data using the kernel trick. Kernel functions have been introduced for sequence data, graphs, text, images, as well as vectors.

To better understand the kernel trick, we provide an example. Assume that we measure two independent variables (x1, x2) in four individuals. In matrix notation, the information of the independent variables (input information) is equal to

$$ \boldsymbol{X}=\left[\begin{array}{cc}{x}_{11}& {x}_{12}\\ {}{x}_{21}& {x}_{22}\\ {}\begin{array}{c}{x}_{31}\\ {}{x}_{41}\end{array}& \begin{array}{c}{x}_{32}\\ {}{x}_{42}\end{array}\end{array}\right] $$

Also, assume we will build a polynomial kernel of degree 2, with

$$ \varphi {\left({\mathbf{x}}_i\right)}^{\mathrm{T}}=\left({z}_1={x}_1^2,{z}_2={x}_2^2,{z}_3=\sqrt{2}{x}_1{x}_2\right). $$

Therefore, for building the Gram matrix (kernel matrix), we need to compute

$$ \boldsymbol{K}=\left[\begin{array}{cc}\varphi {\left({\mathbf{x}}_1\right)}^{\mathrm{T}}\varphi \left({\mathbf{x}}_1\right)& \varphi {\left({\mathbf{x}}_1\right)}^{\mathrm{T}}\varphi \left({\mathbf{x}}_2\right)\kern0.5em \varphi {\left({\mathbf{x}}_1\right)}^{\mathrm{T}}\varphi \left({\mathbf{x}}_3\right)\kern0.5em \varphi {\left({\mathbf{x}}_1\right)}^{\mathrm{T}}\varphi \left({\mathbf{x}}_4\right)\\ {}\varphi {\left({\mathbf{x}}_2\right)}^{\mathrm{T}}\varphi \left({\mathbf{x}}_1\right)& \varphi {\left({\mathbf{x}}_2\right)}^{\mathrm{T}}\varphi \left({\mathbf{x}}_2\right)\kern0.5em \varphi {\left({\mathbf{x}}_2\right)}^{\mathrm{T}}\varphi \left({\mathbf{x}}_3\right)\kern0.5em \varphi {\left({\mathbf{x}}_2\right)}^{\mathrm{T}}\varphi \left({\mathbf{x}}_4\right)\\ {}\begin{array}{c}\varphi {\left({\mathbf{x}}_3\right)}^{\mathrm{T}}\varphi \left({\mathbf{x}}_1\right)\\ {}\varphi {\left({\mathbf{x}}_4\right)}^{\mathrm{T}}\varphi \left({\mathbf{x}}_1\right)\end{array}& \begin{array}{c}\begin{array}{ccc}\varphi {\left({\mathbf{x}}_3\right)}^{\mathrm{T}}\varphi \left({\mathbf{x}}_2\right)& \varphi {\left({\mathbf{x}}_3\right)}^{\mathrm{T}}\varphi \left({\mathbf{x}}_3\right)& \varphi {\left({\mathbf{x}}_3\right)}^{\mathrm{T}}\varphi \left({\mathbf{x}}_4\right)\end{array}\\ {}\begin{array}{ccc}\varphi {\left({\mathbf{x}}_4\right)}^{\mathrm{T}}\varphi \left({\mathbf{x}}_2\right)& \varphi {\left({\mathbf{x}}_4\right)}^{\mathrm{T}}\varphi \left({\mathbf{x}}_3\right)& \varphi {\left({\mathbf{x}}_4\right)}^{\mathrm{T}}\varphi \left({\mathbf{x}}_4\right)\end{array}\end{array}\end{array}\right] $$

This means that we need to compute each coordinate (cell of K) with φ(xi)Tφ(xj), with i, j = 1, 2, 3, 4. Note that

$$ \varphi {\left({\mathbf{x}}_i\right)}^{\mathrm{T}}\varphi \left({\mathbf{x}}_j\right)=\left({x}_{i1}^2,{x}_{i2}^2,\sqrt{2}{x}_{i1}{x}_{i2}\right)\left[\begin{array}{c}{x}_{j1}^2\\ {}{x}_{j2}^2\\ {}\sqrt{2}{x}_{j1}{x}_{j2}\end{array}\right]={x}_{i1}^2{x}_{j1}^2+2{x}_{i1}{x}_{i2}{x}_{j1}{x}_{j2}+{x}_{i2}^2{x}_{j2}^2 $$
$$ ={\left({x}_{i1}{x}_{j1}+{x}_{i2}{x}_{j2}\right)}^2. $$

Therefore,

$$ \boldsymbol{K}=\left[\begin{array}{cc}{\left({x}_{11}^2+{x}_{12}^2\right)}^2& {\left({x}_{11}{x}_{21}+{x}_{12}{x}_{22}\right)}^2\kern0.5em {\left({x}_{11}{x}_{31}+{x}_{12}{x}_{32}\right)}^2\kern0.5em {\left({x}_{11}{x}_{41}+{x}_{12}{x}_{42}\right)}^2\\ {}{\left({x}_{21}{x}_{11}+{x}_{22}{x}_{12}\right)}^2& \begin{array}{ccc}{\left({x}_{21}^2+{x}_{22}^2\right)}^2& {\left({x}_{21}{x}_{31}+{x}_{22}{x}_{32}\right)}^2& {\left({x}_{21}{x}_{41}+{x}_{22}{x}_{42}\right)}^2\end{array}\\ {}\begin{array}{c}{\left({x}_{31}{x}_{11}+{x}_{32}{x}_{12}\right)}^2\\ {}{\left({x}_{41}{x}_{11}+{x}_{42}{x}_{12}\right)}^2\end{array}& \begin{array}{c}\begin{array}{ccc}{\left({x}_{31}{x}_{21}+{x}_{32}{x}_{22}\right)}^2& {\left({x}_{31}^2+{x}_{32}^2\right)}^2& {\left({x}_{31}{x}_{41}+{x}_{32}{x}_{42}\right)}^2\end{array}\\ {}\begin{array}{ccc}{\left({x}_{41}{x}_{21}+{x}_{42}{x}_{22}\right)}^2& {\left({x}_{41}{x}_{31}+{x}_{42}{x}_{32}\right)}^2& {\left({x}_{41}^2+{x}_{42}^2\right)}^2\end{array}\end{array}\end{array}\right] $$

To compute K we calculated each coordinate using φ(xi)Tφ(xj). However, note that

$$ \varphi {\left({\mathbf{x}}_i\right)}^{\mathrm{T}}\varphi \left({\mathbf{x}}_j\right)={\left({x}_{i1}{x}_{j1}+{x}_{i2}{x}_{j2}\right)}^2={\left({\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{x}}_j+0\right)}^2, $$

where \( {\boldsymbol{x}}_i^{\mathrm{T}}=\left[{x}_{i1},{x}_{i2}\right] \) and \( {\boldsymbol{x}}_j=\left[\begin{array}{c}{x}_{j1}\\ {}{x}_{j2}\end{array}\right] \).

Hence, the function

$$ K\left({\mathbf{x}}_j,{\mathbf{x}}_j\right)={\left({\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{x}}_j+0\right)}^2 $$

corresponds to a polynomial kernel of degree d = 2, and constant a = 0, with F, its corresponding feature space. This means that we can compute the inner product between the projections of two points into the feature space without explicitly evaluating the coordinates. In other words, the kernel trick means that we can compute each element (coordinate) of the kernel matrix K, without any knowledge of the true nature of φ(xi); we only need to know the kernel function K(xj, xj). This means that the kernel function is a key ingredient for implementing kernel methods in statistical machine learning.

Next, we provide another simple example also using the polynomial kernel of degree 2, with the same two independent variables (x1, x2) but with a constant value a = 1, that is, K(xj, xj)=\( {\left({\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{x}}_j+1\right)}^2 \). According to the kernel trick, this means that we do not need knowledge of φ(xi) to compute all coordinates of the matrix of kernel K, since each coordinate will take values of

$$ {\displaystyle \begin{array}{c}K\left({\mathbf{x}}_j,{\mathbf{x}}_j\right)={\left({\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{x}}_j+1\right)}^2={\left({x}_{i1}{x}_{j1}+{x}_{i2}{x}_{j2}+1\right)}^2\\ {}={\left({x}_{i1}{x}_{j1}+{x}_{i2}{x}_{j2}\right)}^2+2\left({x}_{i1}{x}_{j1}+{x}_{i2}{x}_{j2}\right)+1\\ {}={x}_{i1}^2{x}_{j1}^2+2{x}_{i1}{x}_{i2}{x}_{j1}{x}_{j2}+{x}_{i2}^2{x}_{j2}^2+2\left({x}_{i1}{x}_{j1}+{x}_{i2}{x}_{j2}\right)+1.\end{array}} $$

Therefore, the K matrix is

$$ =\left[\begin{array}{cc}{\left({x}_{11}^2+{x}_{12}^2+1\right)}^2& {\left({x}_{11}{x}_{21}+{x}_{12}{x}_{22}+1\right)}^2\kern0.5em {\left({x}_{11}{x}_{31}+{x}_{12}{x}_{32}+1\right)}^2\kern0.5em {\left({x}_{11}{x}_{41}+{x}_{12}{x}_{42}+1\right)}^2\\ {}{\left({x}_{21}{x}_{11}+{x}_{22}{x}_{12}+1\right)}^2& \begin{array}{ccc}{\left({x}_{21}^2+{x}_{22}^2+1\right)}^2& {\left({x}_{21}{x}_{31}+{x}_{22}{x}_{32}+1\right)}^2& {\left({x}_{21}{x}_{41}+{x}_{22}{x}_{42}+1\right)}^2\end{array}\\ {}\begin{array}{c}{\left({x}_{31}{x}_{11}+{x}_{32}{x}_{12}+1\right)}^2\\ {}{\left({x}_{41}{x}_{11}+{x}_{42}{x}_{12}+1\right)}^2\end{array}& \begin{array}{c}\begin{array}{ccc}{\left({x}_{31}{x}_{21}+{x}_{32}{x}_{22}+1\right)}^2& {\left({x}_{31}^2+{x}_{32}^2+1\right)}^2& {\left({x}_{31}{x}_{41}+{x}_{32}{x}_{42}+1\right)}^2\end{array}\\ {}\begin{array}{ccc}{\left({x}_{41}{x}_{21}+{x}_{42}{x}_{22}+1\right)}^2& {\left({x}_{41}{x}_{31}+{x}_{42}{x}_{32}+1\right)}^2& {\left({x}_{41}^2+{x}_{42}^2+1\right)}^2\end{array}\end{array}\end{array}\right] $$

This implies that we computed each coordinate of K without first computing φ(xi). This trick is really useful since for computing each coordinate of K in this example, we only performed dot products with vectors of size two, in the original dimension of the input information, and not dot products of vectors of dimension \( \left(\begin{array}{c}2+2\\ {}2\end{array}\right)=6 \), which is the dimension, in this example, of \( \varphi {\left({\mathbf{x}}_i\right)}^{\mathrm{T}}=\left[{x}_{i1}^2,\sqrt{2}{x}_{i1},\sqrt{2}{x}_{i1}{x}_{i2},\sqrt{2}{x}_{i2},{x}_{i2}^2,1\ \right] \). Therefore, this trick facilitates the computation of K since it requires less computation resources. The utility of the trick is better appreciated in a large dimensional setting. For example, assume that the input information of each individual (xi) contains 784 independent variables; this means that to compute matrix K we need to compute, for each coordinate, only dot products of vectors of dimension 784 and not of dimension \( \left(\begin{array}{c}784+2\\ {}2\end{array}\right)=\mathrm{308,505}, \) which is the dimension of φ(xi)T for the same polynomial kernel with degree 2. For this reason, kernel methods are well suited for handling a massive amount of information, because the computational burden can be proportional to the number of data points rather than to the number of predictor variables (e.g., markers in the context of genomic prediction). This is particularly true if a common weight is assigned to each marker (Morota et al. 2013).

In simple terms, the kernel trick makes it possible to perform a transformation from the input data space to a higher dimensional feature space, where the transformed data can be analyzed with conventional linear models and the problem becomes tractable. However, the result highly depends on the considered transformation. If the kernel function is not appropriate for the problem, or the kernel parameters are badly set, the fitted model can be of poor quality. Due to this, special care must be taken when selecting both the kernel function and the kernel parameters to obtain good results.

The kernel trick allows an efficient search in a higher dimensional space, while the related estimation problems are often cast as convex optimization problems that can be solved by many established algorithms and packages. Kernel methods can be applied to all data analysis algorithms whose inputs can be expressed in terms of dot products. If the data in the original space cannot be analyzed satisfactorily with conventional statistical machine learning techniques, the strategy to extend it to nonlinear models using kernel methods is based on the apparently paradoxical idea of transforming the data, by means of a nonlinear function, toward a space with a greater dimension than the space where the data are located and applying any statistical machine learning algorithm to the transformed data.

Therefore, in general terms, the kernelization of an algorithm consists of its reformulation, so that the determination of a pattern or linear regularity in the data can be carried out exclusively from the information collected in the scalar products calculated for all the pairs of elements in the space. Kernel functions are characterized by the property that all finite kernel matrices are positive semi-definite.

8.2.4 Popular Kernel Functions

Next, we provide the most popular kernel methods in statistical machine learning.

Linear Kernel

This kernel is defined as \( K\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_j\right)={\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{x}}_j \). For example,

$$ K\left(\boldsymbol{x},\boldsymbol{z}\right)=\left({x}_1,{x}_2\right)\left(\begin{array}{c}{z}_1\\ {}{z}_2\end{array}\right)={x}_1{z}_1+{x}_2{z}_2=\varphi {\left(\boldsymbol{x}\right)}^{\mathrm{T}}\varphi \left(\boldsymbol{z}\right). $$

Next, we provide an R function for calculating this kernel that can be used for both single-attribute value vectors and for the whole data set:

K.linear=function(x1, x2=x1) {as.matrix(x1)%*%t(as.matrix(x2)) }

Next, we simulate a matrix data set:

set.seed(3) X=matrix(round(rnorm(16,2,0.2),2),ncol=8) X

that gives as output:

> set.seed(3) > X=matrix(round(rnorm(16,2,0.2),2),ncol=8) > X [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] 1.81 2.05 2.04 2.02 1.76 1.85 1.86 2.03 [2,] 1.94 1.77 2.01 2.22 2.25 1.77 2.05 1.94

For individual features in pairs of individuals, this function is used as

> K.linear(X[1,1:4],X[2,1:4]) [,1] [,2] [,3] [,4] [1,] 3.5114 3.2037 3.6381 4.0182 [2,] 3.9770 3.6285 4.1205 4.5510 [3,] 3.9576 3.6108 4.1004 4.5288 [4,] 3.9188 3.5754 4.0602 4.4844

while for the full set of features, it can be used as

> K.linear(X) [,1] [,2] [1,] 29.8212 30.7104 [2,] 30.7104 32.0265

This kernel does not overcome the linearity limitation of linear classification and linear regression models in any way since it leaves the original representation unchanged. It is important to point out that linear kernels (such as linear regression, linear support vector machines, and linear support vector regression algorithms) are special cases of more sophisticated kernel-based algorithms.

Polynomial Kernel

As mentioned above, this kernel is defined as \( K\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_j\right)={\left({\upgamma \boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{x}}_j+a\right)}^d \), where a is a real scalar and d is a positive integer, and where 𝛾 > 0, a ≥ 0, and d > 0 are parameters. This kernel family makes it possible to easily control the enhanced representation size and degree of nonlinearity by adjusting the d parameter. Positive a can be used to adjust the relative impact of higher order and lower order terms in the resulting polynomial representation. For example, when γ = 1, a = 0, and d = 2, we have

$$ K\left(\boldsymbol{x},\boldsymbol{z}\right)={\left(\left({x}_1,{x}_2\right)\left(\begin{array}{c}{z}_1\\ {}{z}_2\end{array}\right)\right)}^2={\left({x}_1{z}_1+{x}_2{z}_2\right)}^2= $$
$$ {x}_1^2{z}_1^2+2{x}_1{z}_1{x}_2{z}_2+{x}_2^2{z}_2^2=\left[{x}_1^2,\sqrt{2}{x}_1{x}_2,{x}_2^2\right]\left[\begin{array}{c}{z}_1^2\\ {}\sqrt{2}{z}_1{z}_2\\ {}{z}_2^2\end{array}\right]=\varphi {\left(\boldsymbol{x}\right)}^{\mathrm{T}}\varphi \left(\boldsymbol{z}\right). $$

However, when γ = 1, a = 1, and d = 2, we have

$$ K\left(\boldsymbol{x},\boldsymbol{z}\right)={\left(\left({x}_1,{x}_2\right)\left(\begin{array}{c}{z}_1\\ {}{z}_2\end{array}\right)+1\right)}^2={\left({x}_1{z}_1+{x}_2{z}_2+1\right)}^2= $$
$$ 1+2{x}_1{z}_1+2{x}_2{z}_2+{x}_1^2{z}_1^2+{x}_2^2{z}_2^2+2{x}_1{z}_1{x}_2{z}_2=\left[1,\sqrt{2}{x}_1,\sqrt{2}{x}_2,{x}_1^2,\sqrt{2}{x}_1{x}_2,{x}_2^2\right]\left[\begin{array}{c}\begin{array}{c}1\\ {}\begin{array}{c}\sqrt{2}{z}_1\\ {}\begin{array}{c}\sqrt{2}{z}_2\\ {}{z}_1^2\end{array}\end{array}\end{array}\\ {}\sqrt{2}{z}_1{z}_2\\ {}{z}_2^2\end{array}\right]=\varphi {\left(\boldsymbol{x}\right)}^{\mathrm{T}}\varphi \left(\boldsymbol{z}\right). $$

This demonstrates that increasing a increases the coefficients of lower order terms. The dimension of the feature space for the polynomial kernel is equal to \( \left(\begin{array}{c}p+d\\ {}d\end{array}\right) \). For example, for an input vector of dimension p = 10 and polynomial with degree d = 3, the dimension for this polynomial kernel is equal to \( \left(\begin{array}{c}10+3\\ {}3\end{array}\right)= \)286, while if p = 1000 and d = 3, the dimension for this polynomial kernel is equal to \( \left(\begin{array}{c}1000+3\\ {}3\end{array}\right)= \)167,668,501. Although convenient to control and easy to understand, the polynomial kernel family may be insufficient to adequately represent more complex relationships.

The R code for calculating this kernel is given next:

K.polynomial=function(x1, x2=x1, gamma=1, b=0, d=3){ (gamma*(as.matrix(x1)%*%t(x2))+b)^d}

Now this function can be used as

> K.polynomial(X[1,1:4],X[2,1:4]) [,1] [,2] [,3] [,4] [1,] 43.29532 32.88180 48.15306 64.87758 [2,] 62.90234 47.77288 69.95999 94.25850 [3,] 61.98630 47.07716 68.94117 92.88582 [4,] 60.18099 45.70607 66.93331 90.18058

But for the full set of features, it can be used as

> K.polynomial(X) [,1] [,2] [1,] 26520.11 28963.86 [2,] 28963.86 32849.48

Sigmoidal Kernel

This kernel is defined as \( K\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_j\right)=\tan h\left(\gamma {\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{x}}_j+b\right) \), where tanh is the hyperbolic tangent defined as \( \tan h(z)=\sin h(z)/\cos h(z)=\frac{\exp (z)-\exp \left(-z\right)}{\exp (z)+\exp \left(-z\right)} \). This function is widely used as the activation function for artificial neural networks and deep learning models, and hence has also become popular for kernel methods. If used with properly adjusted parameters, it can represent complex nonlinear relationships. In some parameter settings, it actually becomes similar to the radial kernel (Lin and Lin 2003) described below. However, the sigmoid function may not be positive definite for some parameters, and therefore may not actually represent a valid kernel (Lin and Lin 2003).

Next, we provide an R code for calculating this kernel:

K.sigmoid=function(x1,x2=x1, gamma=0.1, b=0) { tanh(gamma*(as.matrix(x1)%*%t(x2))+b) }

This function is used as

> K.sigmoid(X[1,1:4],X[2,1:4]) [,1] [,2] [,3] [,4] [1,] 0.3373862 0.3098414 0.3485656 0.3815051 [2,] 0.3779793 0.3477219 0.3902119 0.4260822 [3,] 0.3763152 0.3461650 0.3885066 0.4242635 [4,] 0.3729798 0.3430454 0.3850881 0.4206158

For the full set of features, it is used as

> K.sigmoid(X) [,1] [,2] [1,] 0.9948752 0.9957083 [2,] 0.9957083 0.9966999

Gaussian Kernel

This kernel, also known as the radial basis function kernel, depends on the Euclidean distance between the original attribute value vectors (i.e., the Euclidean norm of their difference) rather than on their dot product, \( K\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_j\right)={e}^{-\gamma {\left\Vert {\boldsymbol{x}}_i-{\boldsymbol{x}}_j\right\Vert}^2}={e}^{-\gamma \left[{\boldsymbol{x}}_i^T{\boldsymbol{x}}_i-2{\boldsymbol{x}}_i^T{\boldsymbol{x}}_i+{\boldsymbol{x}}_j^T{\boldsymbol{x}}_j\right]} \), where γ is a positive real scalar. It is known that the feature vector φ that corresponds to the Gaussian kernel is actually infinitely dimensional (Lin and Lin 2003). Therefore, without the kernel trick, the solution cannot be computed explicitly. This type of kernel tends to be particularly popular, but it is sensitive to the choice of the 𝛾 parameter and may be prone to overfitting.

The R code for calculating this kernel is given next:

l2norm=function(x){sqrt(sum(x^2))} K.radial=function(x1,x2=x1, gamma=1){ exp(-gamma*outer(1:nrow(x1 <- as.matrix(x1)), 1:ncol(x2 <- t(x2)), Vectorize(function(i, j) l2norm(x1[i,]-x2[,j]) ^2)))}

This function is used as

> K.radial(X[1,1:4],X[2,1:4]) [,1] [,2] [,3] [,4] [1,] 0.9832420 0.9984013 0.9607894 0.8452693 [2,] 0.9879729 0.9245945 0.9984013 0.9715136 [3,] 0.9900498 0.9296938 0.9991004 0.9681193 [4,] 0.9936204 0.9394131 0.9999000 0.9607894

while for the full set of features, it can be used as

> K.radial(X) [,1] [,2] [1,] 1.0000000 0.6525288 [2,] 0.6525288 1.0000000

The parameter γ controls the flexibility of the Gaussian kernel in a similar way as the degree d in the polynomial kernel. Large values of γ correspond to large values of d since, for example, they allow classifiers to fit any labels, hence risking overfitting. In such cases, the kernel matrix becomes close to the identity matrix. On the other hand, small values of γ gradually reduce the kernel to a constant function, making it impossible to learn any nontrivial classifier. The feature space has infinite dimensions for every value of γ, but for large values, the weight decays very fast on the higher order features. In other words, although the rank of the kernel matrix is full, for all practical purposes, the points lie in a low-dimensional subspace of the feature space.

Exponential Kernel

This kernel is defined as \( K\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_j\right)={e}^{-\gamma \left\Vert {\boldsymbol{x}}_i-{\boldsymbol{x}}_j\right\Vert } \), which is quite similar to the Gaussian kernel function.

The R code is given below:

K.exponential=function(x1,x2=x1, gamma=1){ exp(-gamma*outer(1:nrow(x1 <- as.matrix(x1)), 1:ncol(x2 <- t(x2)), Vectorize(function(i, j) l2norm(x1[i,]-x2[,j]))))}

For individual features in pairs of individuals, it can be used as

> K.exponential(X[1,1:4],X[2,1:4]) [,1] [,2] [,3] [,4] [1,] 0.8780954 0.9607894 0.8187308 0.6636503 [2,] 0.8958341 0.7557837 0.9607894 0.8436648 [3,] 0.9048374 0.7633795 0.9704455 0.8352702 [4,] 0.9231163 0.7788008 0.9900498 0.8187308

while for full set of features, it can be used as

> K.exponential(X) [,1] [,2] [1,] 1.0000000 0.5202864 [2,] 0.5202864 1.0000000

Arc-Cosine Kernel (AK)

For AK, an important component is the angle between two vectors computed from inputs xi, xj as

$$ \kern0.5em {\theta}_{i,j}={\cos}^{-1}\left(\frac{{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{x}}_j}{\left\Vert {\boldsymbol{x}}_i\right\Vert \left\Vert {\boldsymbol{x}}_j\right\Vert}\right), $$

where ‖xi‖ is the norm of observation i. The following kernel is positive semi-definite and related to an ANN with a single hidden layer and the ramp activation function (Cho and Saul 2009).

$$ \mathrm{A}{\mathrm{K}}^1\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_j\right)=\frac{1}{\pi}\left\Vert {\boldsymbol{x}}_i\right\Vert \left\Vert {\boldsymbol{x}}_j\right\Vert\ J\ \left({\theta}_{i,j}\right), $$
(8.4)

where π is the pi constant and J(θi,j) = [sin(θi,j) + (π − θi,j) cos (θi,j)]. Equation (8.4) gives a symmetric positive semi-definite matrix (AK1) preserving the norm of the entries such that AK(xi, xi) = ‖xi2 and AK(xi, −xi) = 0 and models nonlinear relationships.

Note that the diagonals of the AK matrix are not homogeneous and express heterogeneous variances of the genetic value u; this is different from the Gaussian kernel matrix, with a diagonal that expresses homogeneous variances. This property could be a theoretical advantage of AK when modeling interrelationships between individuals.

In order to emulate the performance of an ANN with more than one hidden layer (l), Cho and Saul (2009) proposed a recursive relationship of repeating l times the interior product:

$$ {\mathrm{AK}}^{\left(l+1\right)}\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_j\right)=\frac{1}{\pi }{\left[{\mathrm{AK}}^{(l)}\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_i\right){\mathrm{AK}}^{(l)}\left({\boldsymbol{x}}_j,{\boldsymbol{x}}_j\right)\right]}^{\frac{1}{2}}\ J\left({\theta}_{i,j}^{(l)}\right), $$
(8.5)

where \( {\theta}_{i,j}^{(l)}={\cos}^{-1}\left\{{\mathrm{AK}}^{(l)}\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_j\right){\left[{\mathrm{AK}}^{(l)}\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_i\right){\mathrm{AK}}^{(l)}\left({\boldsymbol{x}}_j,{\boldsymbol{x}}_j\right)\right]}^{-\frac{1}{2}}\right\} \).

Thus, computing AK(l + 1) at level (layer) l + 1 is done from the previous layer AK(l). Computing a bandwidth (the smoothing parameter that controls variance and bias in the output, e.g., the γ parameter in the Gaussian kernel) is not necessary, and the only computational effort required is to compute the number of hidden layers. Cuevas et al. (2019) described a maximum marginal likelihood method used to select the number of hidden layers (l) for the AK kernel. It is important to point out that this kernel method is like a deep neural network since it allows using more than one hidden layer.

The R code for the AK kernel with one hidden layer is given below:

K.AK1_Final<-function(x1,x2){ n1<-nrow(x1) n2<-nrow(x2) x1tx2<-x1%*%t(x2) norm1<-sqrt(apply(x1,1,function(x) crossprod(x))) norm2<-sqrt(apply(x2,1,function(x) crossprod(x))) costheta = diag(1/norm1)%*%x1tx2%*%diag(1/norm2) costheta[which(abs(costheta)>1,arr.ind = TRUE)] = 1 theta<-acos(costheta) normx1x2<-norm1%*%t(norm2) J = (sin(theta)+(pi-theta)*cos(theta)) AK1 = 1/pi*normx1x2*J AK1<-AK1/median(AK1) colnames(AK1)<-rownames(x2) rownames(AK1)<-rownames(x1) return(AK1) }

For the full set of features, it can be used as

> K.AK1_Final(x1=X,x2=X) [,1] [,2] [1,] 0.9709 1.000000 [2,] 1.0000 1.042699

Since the K.AK1_Final() kernel function is only useful for one hidden layer, for this reason, the next part of the code extends this to more than one hidden layer.

####Kernel Arc-Cosine with deep=4##### diagAK_f<-function(dAK1) { AKAK = dAK1^2 costheta = dAK1*AKAK^(-1/2) costheta[which(costheta>1,arr.ind = TRUE)] = 1 theta = acos(costheta) AKl = (1/pi)*(AKAK^(1/2))*(sin(theta)+(pi-theta)*cos(theta)) AKl AKl<-AKl/median(AKl) } AK_L_Final<-function(AK1,dAK1,nl){ n1<-nrow(AK1) n2<-ncol(AK1) AKl1 = AK1 for ( l in 1:nl){ AKAK<-tcrossprod(dAK1,diag(AKl1)) costheta<-AKl1*(AKAK^(-1/2)) costheta[which(costheta>1,arr.ind = TRUE)] = 1 theta <-acos(costheta) AKl<-(1/pi)*(AKAK^(1/2))*(sin(theta)+(pi-theta)*cos(theta)) dAKl = diagAK_f(dAK1) AKl1 = AKl dAK1 = dAKl } AKl<-AKl/median(AKl) rownames(AKl)<-rownames(AK1) colnames(AKl)<-colnames(AK1) return(AKl) }

Next, we illustrate how to use this kernel function for an AR kernel with four hidden layers:

> AK1=K.AK1_Final(x1=X,x2=X) > AK_L_Final(AK1= AK1,dAK1=diag(AK1),nl=4) [,1] [,2] [1,] 0.9649746 1.000000 [2,] 1.0000000 1.036335

Hybrid Kernel

We understand by hybrid kernels when two or more kernels are combined, since complex kernels can be created by simple operations (multiplication, addition, etc.) that combine simpler kernels. An example of a hybrid kernel can be obtained by multiplying the polynomial kernel and the Gaussian kernel. This kernel is defined as \( {\left({\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{x}}_j+a\right)}^d{e}^{-\gamma \left\Vert {\boldsymbol{x}}_i-{\boldsymbol{x}}_j\right\Vert } \). However, other types of kernels can also be combined in the same fashion or with other basic operations, like kernel averaging, which is explained next.

Kernel Averaging

Averaging is another way to create hybrid kernels, since kernel methods do not preclude the use of several kernels together (de los Campos et al. 2010). To illustrate the construction of these kernels, we assume that we have three kernels K1, K2, and K3 that are distinct from each other. In this approach, the three kernels are “averaged” to form a new kernel \( \boldsymbol{K}={\boldsymbol{K}}_1\frac{\sigma_{K_1}^2}{\sigma_K^2}+{\boldsymbol{K}}_2\frac{\sigma_{K_2}^2}{\sigma_K^2}+{\boldsymbol{K}}_3\frac{\sigma_{K_3}^2}{\sigma_K^2} \), where \( {\sigma}_{K_1}^2,{\sigma}_{K_2}^2,{\sigma}_{K_3}^2 \) are variance components attached to kernels K1, K2, and K3, respectively, and \( {\sigma}_K^2 \) is the sum of the three variances. The ratios of the three variance components are tantamount to the relative contributions of the kernels. For instance, the kernels used can be three Gaussian kernels with different bandwidth parameter values, as employed in Tusell et al. (2013), or one can fit several parametric kernels jointly, e.g., the additive (G), dominance (D), and additive by dominance (G#D) kernels, as in Morota et al. (2014). While there are many possible choices of kernels, the kernel function can be estimated via maximum likelihood by recourse to the Matérn family of covariance functions (e.g., Ober et al. 2011) or by fitting several candidate kernels simultaneously through multiple kernel learning.

Hybrid kernels illustrate a general principle of how more complex kernels can be created from simpler ones in a number of different ways. Kernels can even be constructed that correspond to infinite-dimensional feature spaces at the cost of only a few extra operations in the kernel evaluations, like the Gaussian kernel which most often is good enough (Ober et al. 2011). There are many other kernels, however, the above-mentioned kernels are the most popular. For example, Morota et al. (2013) evaluated diffusion kernels for discrete inputs with animal and plant data, and compared these to the Gaussian kernel. Differences in predictive ability were minimal; this is fortunate because computing diffusion kernels is time-consuming.

The bandwidth parameter can be selected based on (1) a cross-validation procedure, (2) restricted maximum likelihood (Endelman 2011), and (3) an empirical Bayesian method such as the one proposed by Pérez-Elizalde et al. (2015). The optimal value of the bandwidth parameter is expected to change with many factors such as (a) distance function, (b) number of markers, allelic frequency, and coding of markers, all markers affecting the distribution of observed distances, and (c) genetic architecture of the trait, a factor affecting the expected prior correlation of genetic values (de los Campos et al. 2010).

As pointed out above, kernel methods only need information of the kernel function K(xi, xj), assuming that this has been defined. For this reason, nonvectorial patterns x such as sequences, trees, and graphs can be handled. That is, kernel functions are not restricted to vectorial inputs: kernels can be designed for objects and structures as diverse as strings, graphs, text documents, sets, and graph nodes. It is important to point out that the kernel trick can be applied in unsupervised methods like cluster analysis, and dimensionality reduction methods like principal component analysis, independent component analysis, etc.

8.2.5 A Two Separate Step Process for Building Kernel Machines

The goal of this section is to emphasize that the building process of kernel machines consists of two general independent steps. The first one consists of calculating the Gram matrix (kernel matrix K) using only the information of the independent variables (input). This means that in this process the user needs to define the type of kernel function that he (she) will use in such a way as to capture the hypothesized nonlinear patterns in the input data. Then in the second step, after the kernel is ready, we select the statistical machine learning algorithm that will be used for training the model using the dependent variable, the kernel built in the first step and other available covariates. These two separate steps for building kernel methods for prediction imply that we can use conventional linear statistical machine learning algorithms to accommodate a particular type of kernel function. The only important consideration when choosing the kernel is that it should be suitable for the data at hand. But, if you built the kernel, you can evaluate the performance of this kernel with many other statistical machine learning methods. This illustrates the two separate steps required for training predictive machines using kernel methods where any statistical machine learning method can be combined with any kernel function. It is important to point out that since many machine learning methods are only able to work with linear patterns, using the kernel trick allows you to build nonlinear versions of the linear algorithms, without the need to modify the original machine learning algorithm. The following sections show how the kernel trick works in some standard statistical machine learning models.

8.3 Kernel Methods for Gaussian Response Variables

When the response variable is Gaussian, the negative log-likelihood that needs to be used to minimize expression (8.3) belongs to a normal distribution and the expression (8.3) is reduced to

$$ \underset{\eta_0,\boldsymbol{\beta}\ }{\underbrace{\min}}\left\{\frac{1}{n}\sum \limits_{i=1}^n{\left({y}_i-{\eta}_0-{\boldsymbol{k}}_i^{\mathrm{T}}\boldsymbol{\beta} \right)}^2+\frac{\lambda }{2}{\boldsymbol{\beta}}^{\mathrm{T}}\boldsymbol{K}\boldsymbol{\beta } \right\}. $$

In matrix notation, the latter expression can be expressed as

$$ \underset{\eta_0,\boldsymbol{\beta}}{\underbrace{\min}}\left\{\frac{1}{2}{\left({\boldsymbol{y}}^{\ast}-\boldsymbol{K}\boldsymbol{\beta } \right)}^{\mathrm{T}}\left({\boldsymbol{y}}^{\ast}-\boldsymbol{K}\boldsymbol{\beta } \right)+\frac{\lambda }{2}{\boldsymbol{\beta}}^{\mathrm{T}}\boldsymbol{K}\boldsymbol{\beta } \right\}, $$

where \( {\boldsymbol{y}}^{\ast}=\boldsymbol{y}-\mathbf{1}\overline{y} \), using \( \overline{y} \) as an estimator of the intercept (η0). The first-order conditions to this problem are familiar to us (see Chap. 3) and are

$$ \left[{\boldsymbol{K}}^{\mathrm{T}}\boldsymbol{K}+\lambda \boldsymbol{K}\right]\boldsymbol{\beta} ={\boldsymbol{K}}^{\mathrm{T}}{\boldsymbol{y}}^{\ast } $$

Further, since K = KT and K−1 exist, pre-multiplication by K−1 yields

$$ \boldsymbol{\beta} ={\left[\boldsymbol{K}+\lambda \boldsymbol{I}\right]}^{-\mathbf{1}}{\boldsymbol{y}}^{\ast }, $$

where I is an identity matrix of dimension n × n, and to estimate β, λ must be known. It is important to point out that even in the context of large p and small n, the number of beta coefficients (β) that need to be estimated is equal to n, which considerably reduces the computation resources in the estimation process. This solution to the beta coefficients obtained under Gaussian response variables is known as kernel Ridge regression in statistical machine learning, and was first obtained by Gianola et al. (2006) and Gianola and van Kaam (2008) in the context of a mixed effects model under a Bayesian treatment. The predicted values in the original scale of the response variables can be obtained as

$$ \hat{\boldsymbol{y}}=\mathbf{1}\overline{y}+\boldsymbol{K}\hat{\boldsymbol{\beta}}. $$

For a new observation with vector of inputs (xnew), the predictions are made using the following expression:

$$ {\hat{\boldsymbol{y}}}_{\mathrm{new}}=\overline{y}+\sum \limits_{i=1}^n\hat{\beta_i}K\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_{\mathrm{new}}\right) $$

Next, we provide some examples of Gaussian response variables using different kernel methods.

Example 1

for continuous response variables . The data comprise family, marker, and phenotypic information of 599 lines that were evaluated for grain yield (GY) in four environments. Marker information consisted of 1447 Diversity Array Technology (DArT) markers, generated by Triticarte Pty. Ltd. (Canberra, Australia). Also, this data set contains the pedigree relationship matrix and is preloaded in the BGLR package with the name wheat. We named this data set the wheat599 data set. The GY measured in the four environments was used for single environment analysis using various kernel methods.

The first six observations for trait GY in the four environments (labeled 1, 2, 4, and 5) are given next.

> head(y) 1 2 4 5 775 1.6716295 -1.72746986 -1.89028479 0.0509159 2166 -0.2527028 0.40952243 0.30938553 -1.7387588 2167 0.3418151 -0.64862633 -0.79955921 -1.0535691 2465 0.7854395 0.09394919 0.57046773 0.5517574 3881 0.9983176 -0.28248062 1.61868192 -0.1142848 3889 2.3360969 0.62647587 0.07353311 0.7195856

Also, next are given the first six observations for five standardized markers

> head(XF[,1:5]) wPt.0538 wPt.8463 wPt.6348 wPt.9992 wPt.2838 [1,] -1.3598855 0.2672768 0.772228 0.4419075 0.439209 [2,] 0.7341284 0.2672768 0.772228 0.4419075 0.439209 [3,] 0.7341284 0.2672768 0.772228 0.4419075 0.439209 [4,] -1.3598855 0.2672768 0.772228 0.4419075 0.439209 [5,] -1.3598855 0.2672768 0.772228 0.4419075 0.439209 [6,] 0.7341284 0.2672768 0.772228 0.4419075 0.439209

Then with the code given in Appendix 1, that uses the wheat599 data set, the nine kernels explained above were illustrated. We implemented the kernel Ridge regression method using the library glmnet. The results of the nine kernels for GY in each of the four environments are given next.

Table 8.1 indicates that the best predictions were observed in the four environments under the Sigmoid kernel and the worst under the polynomial kernel.

Table 8.1 Prediction performance in terms of the mean square error for GY in the four environments (Env) under nine kernel methods

8.4 Kernel Methods for Binary Response Variables

When the response variable is binary, instead of using the sum of squares loss function that was used before for continuous response variables, we now use the negative log-likelihood of the product of Bernoulli distributions, and the expression that needs to be minimized is given next:

$$ \underset{\eta_0,\boldsymbol{\beta}\ }{\underbrace{\min}}\left\{\frac{1}{n}\sum \limits_{i=1}^n{\left({y}_i\left({\eta}_0+{\boldsymbol{k}}_i^{\mathrm{T}}\boldsymbol{\beta} \right)+\log \left[1+\exp \left({\eta}_0+{\boldsymbol{k}}_i^{\mathrm{T}}\boldsymbol{\beta} \right)\right]\right)}^2+\frac{\lambda }{2}{\boldsymbol{\beta}}^{\mathrm{T}}\boldsymbol{K}\boldsymbol{\beta } \right\} $$

Estimation of the parameters η0 and β requires an iterative procedure, and gradient descent methods are used for their estimation, like those explained for logistic regression in Chaps. 3 and 7. Here, for the examples, we will use the glmnet package.

In Table 8.2, we can observe the prediction performance using the binary trait Height of the Data_Toy_EYT.R with nine kernels (linear, polynomial, sigmoid, Gaussian, exponential, AK1, AK2, AK3, and AK4). The Data_Toy_EYT data set contains 160 observations with 40 in each of the four environments that are present. The phenotypic information consists of a column for lines, another for environments and four corresponding to traits, two measured on a categorical scale, one continuous, and the last one binary. The data set also contains a genomic relationship matrix of the 40 lines that were evaluated in each of the four environments. Ten fold cross-validation was implemented and the worst performance in terms of the proportion of cases correctly classified (PCCC) was with the sigmoid kernel and the best under the polynomial and AK4 kernels. The R code for reproducing the results in Table 8.2 is given in Appendix 2.

Table 8.2 Prediction performance in terms of the proportion of cases correctly classified (PCCC) with ten fold cross-validation for the binary trait Height of the Data_Toy_EYT.R data set with nine kernels

8.5 Kernel Methods for Categorical Response Variables

For categorical response variables, the loss function is the negative log-likelihood of the product of multinomial distributions and the expression that needs to be minimized is given next:

$$ \underset{\eta_0,\boldsymbol{\beta}}{\underbrace{\min}}\left\{-\frac{1}{n}\left(\sum \limits_{i=1}^n\sum \limits_{c=1}^C{I}_{\left\{{y}_i=c\right\}}\left({\eta}_{0c}+{\boldsymbol{k}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_c\right)-\sum \limits_{i=1}^n\log \left[\sum \limits_{l=1}^C\exp \left({\eta}_{0l}+{\boldsymbol{k}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_l\right)\right]\right)+\frac{\lambda }{2}\sum \limits_{l=1}^C{\boldsymbol{\beta}}_l^{\mathrm{T}}{\boldsymbol{K}\boldsymbol{\beta}}_l\right\} $$

The estimation process does not have an analytical solution, and gradient descent methods are used for the estimation of the required parameters. The optimization process is done with the same methods described in Chaps. 3 and 7 for categorical response variables. The following illustrative examples were implemented using the glmnet library.

Now the Data_Toy_EYT.R data set was used that was also used for illustrating kernels with binary response variables. The nine kernels were implemented but with the categorical response variable days to heading (DTHD). Again, the worst predictions occurred with the sigmoid kernel, but now the best predictions were achieved with the AK2 kernel (Table 8.3). The code given in Appendix 2 can be used for reproducing these results with two small modifications: (a) replace the response variable y2=Pheno$Height with y2=Pheno$DTHD and (b) in the specification of the model in glmnet, replace family='binomial' with family='multinomial'.

Table 8.3 Prediction performance in terms of the proportion of cases correctly classified (PCCC) with ten fold cross-validation for the categorical trait days to heading (DTHD) of the Data_Toy_EYT.R data set with nine kernels

8.6 The Linear Mixed Model with Kernels

Under a linear mixed model (LMM) (y = Cθ + Kβ + e), every individual i is associated with a genotype vector \( {\boldsymbol{x}}_i^{\mathrm{T}} \) and a covariate vector \( {\boldsymbol{c}}_i^{\mathrm{T}} \) (e.g., gender, age, herd, race, environment, etc.). Given a sample of individuals with a genotyped variants matrix \( \boldsymbol{X}={\left[{\boldsymbol{x}}_1^{\mathrm{T}},{\boldsymbol{x}}_2^{\mathrm{T}},\dots, {\boldsymbol{x}}_n^{\mathrm{T}}\right]}^{\mathrm{T}} \) and matrix of incidence nuisance variables C = \( {\left[{\boldsymbol{c}}_1^{\mathrm{T}},{\boldsymbol{c}}_2^{\mathrm{T}},\dots, {\boldsymbol{c}}_n^{\mathrm{T}}\right]}^{\mathrm{T}}, \)relating some effect (θ) to the phenotype vector y = [y1, y2, …, yn]T that follows a multivariate normal distribution.

$$ \left.\boldsymbol{y}\right|\boldsymbol{X},\boldsymbol{C}\sim \boldsymbol{N}\left(\boldsymbol{C}\boldsymbol{\theta }, \boldsymbol{K}+\boldsymbol{I}{\sigma}_e^2\right) $$

Here, K is a valid kernel encoding genotypic covariance, as long as it is positive semi-definite and, again, represents similarities between genotyped individuals. Now the nonparametric function is f(X) =  and the nonparametric coefficients, β, and residuals can be assumed to be independently distributed as \( \boldsymbol{\beta} \sim \boldsymbol{N}\left(\mathbf{0},{\boldsymbol{K}}^{-\mathbf{1}}{\sigma}_{\beta}^2\right) \) and \( \boldsymbol{e}\sim \boldsymbol{N}\left(\mathbf{0},\boldsymbol{I}{\sigma}_e^2\right) \). θ is a vector of covariate coefficients (denoted as fixed effects), I is the n × n identity matrix, and \( {\sigma}_e^2 \) is the variance of the microenvironmental effects. Now under this LMM approach, the function to be minimized becomes

$$ \underset{\boldsymbol{\theta}, \boldsymbol{\beta}}{\underbrace{\min}}\underset{\kern8.25em \boldsymbol{\theta}, \boldsymbol{\beta}\ }{J\left[\boldsymbol{\theta}, \left.\boldsymbol{\beta} \right|\lambda \right]=\underbrace{\min}}\left\{\frac{1}{2{\sigma}_e^2}{\left[\boldsymbol{y}-\boldsymbol{C}\boldsymbol{\theta } -\boldsymbol{K}\boldsymbol{\beta } \right]}^{\mathrm{T}}\left[\boldsymbol{y}-\boldsymbol{C}\boldsymbol{\theta } -\boldsymbol{K}\boldsymbol{\beta } \right]+\frac{\lambda }{2}{\boldsymbol{\beta}}^{\mathrm{T}}\boldsymbol{K}\boldsymbol{\beta } \right\}. $$

After setting the gradient of J(.) with respect to θ and β simultaneously to zero (Mallick et al. 2005; Gianola et al. 2006; Gianola and van Kaam 2008), the RKHS regression estimating equations can be formulated in matrix form given \( {\sigma}_e^2 \) and λ as

$$ \left[\begin{array}{c}{\boldsymbol{C}}^{\mathrm{T}}\boldsymbol{C}\kern4.75em {\boldsymbol{C}}^{\mathrm{T}}\boldsymbol{K}\\ {}\begin{array}{cc}{\boldsymbol{K}}^{\mathrm{T}}\boldsymbol{C}& {\boldsymbol{K}}^{\mathrm{T}}\boldsymbol{K}+\lambda \boldsymbol{K}{\sigma}_e^2\end{array}\end{array}\right]\left[\begin{array}{c}\hat{\boldsymbol{\theta}}\\ {}\hat{\boldsymbol{\beta}}\ \end{array}\right]=\left[\begin{array}{c}{\boldsymbol{C}}^{\mathrm{T}}\boldsymbol{y}\\ {}{\boldsymbol{K}}^{\mathrm{T}}\boldsymbol{y}\end{array}\right] $$
(8.6)

Recall that K is symmetric, so KTK = K2, and by multiplying the second system of (8.6) by K−1 (assuming the inverse exists), we obtain

$$ \left[\begin{array}{c}{\boldsymbol{C}}^{\mathrm{T}}\boldsymbol{C}\kern4.75em {\boldsymbol{C}}^{\mathrm{T}}\boldsymbol{K}\\ {}\begin{array}{cc}{\boldsymbol{I}}^{\mathrm{T}}\boldsymbol{C}\kern1.75em & \boldsymbol{K}+\lambda \boldsymbol{I}{\sigma}_e^2\end{array}\end{array}\right]\left[\begin{array}{c}\hat{\boldsymbol{\theta}}\\ {}\hat{\boldsymbol{\beta}}\ \end{array}\right]=\left[\begin{array}{c}{\boldsymbol{C}}^{\mathrm{T}}\boldsymbol{y}\\ {}\boldsymbol{y}\end{array}\right] $$
(8.7)

This avoids inverting K and forming KTK. Note that the variance of the nonparametric coefficient \( {\sigma}_{\boldsymbol{\beta}}^2={\lambda}^{-1} \) may be interpreted as variation due to marked additive genomic variation.

The mixed model y = Cθ + Kβ + e (reparametrization I) can be reparametrized as y = Cθ + u + e (reparametrization II), where u = Kβ, but with u distributed as \( \boldsymbol{u}\sim \boldsymbol{N}\left(\mathbf{0},\boldsymbol{K}{\sigma}_u^2\right), \) and \( {\sigma}_u^2 \) is the additive variance due to lines. Both parametrizations produce the same solution since they are equivalent, with the following peculiarities. Parametrization I has two main advantages: (1) kernel matrix K does not need to be inverted. The inverse of kernel matrix K may be time-consuming or unfeasible if the number of genotyped individuals is large, because the matrix is too dense. Currently, there is the need to invert the matrix up to 100,000×100,000. (2) Genome-enabled prediction of breeding values for any t new genotyped individuals (\( {\hat{\boldsymbol{u}}}_{\mathrm{new}} \)) without phenotype can be done using a simple matrix–vector product \( {\hat{\boldsymbol{u}}}_{\mathrm{new}}={\boldsymbol{K}}_s\hat{\boldsymbol{\beta}} \), where \( \hat{\boldsymbol{\beta}} \) are the n nonparametric coefficients estimated from the n individuals in the training set, Ks is a matrix of dimension (t × n) containing the genomic similarity values between the t new individuals whose direct genomic merits are to be predicted and the individuals in the training set. When K is the genomic relationship matrix calculated as suggested by VanRaden (2008), the kernel is linear, but when K is calculated with nonlinear kernels such as the Gaussian, exponential, polynomial, arc-cosine, sigmoid, etc., the same model can be used to capture nonlinear patterns better. This means that with the mixed model equations given above, it is possible to implement any of the proposed kernels since the only difference between them is the transformation performed on the input information to obtain a particular kernel. This means that the genomic relationship matrix used in a method known as genomic BLUP (GBLUP) is replaced by a more general kernel matrix that creates similarities among individuals, even if genetically unrelated. However, for a particular data set, some kernels will perform better and others worse, since the performance of the kernels is data-dependent. In our context, a kernel is any smooth function K defining a covariance structure among individuals. Also, as mentioned above, we can mix many kernels using a weighted sum or product of them to create new kernels. In general, as mentioned many times in this chapter, there is enough empirical evidence that kernel methods outperform conventional regression methods that are only able to capture linear patterns (Tusell et al. 2013; Long et al. 2010; Morota et al. 2013, 2014).

The solution of the mixed model equations given in (8.6 and 8.7) can be obtained using the rrBLUP package (Endelman 2011). This package is not restricted only to linear kernels since it is also useful for estimating marker effects by Ridge regression, and BLUPs calculated based on an additive relationship matrix. In this package, variance components are estimated by either maximum likelihood (ML) or restricted maximum likelihood (REML; default) using the spectral decomposition algorithm of Kang et al. (2008). The R function returns the variance components, the maximized log-likelihood (LL), the ML estimate for θ, and the BLUP solution for u.

The basic function for implementing the kernel methods using the rrBLUP package is given next:

mixed.solve(y=y, Z=Z, K=K, X=X, method="REML"),

where y, Z, K, and X are the vector of response variables, the design matrix of random effects, the kernel matrix, and the design matrix of fixed effects, respectively. The estimation method by default is REML, but the ML method is also allowed. The kernel matrix is calculated before using the mixed.solve() function. It is important to point out that the function kinship.BLUP allows implementing three kernel methods directly [linear kernel (RR), Gaussian kernel (GAUSS), and Exponential kernel (EXP)] under a mixed model approach.

kinship.BLUP(y=y[trn], G.train=W[trn,], G.pred=W[tst,], X=X[trn,], K.method="GAUSS", mixed.method="REML"),

where y[trn] contains the training part of the response variable, W[trn,] contains the markers corresponding to the training set, W[tst,] contains the testing set of marker data, X[trn,] contains the fixed effects corresponding to the training set, the K.method="GAUSS" specifies that the Gaussian kernel will be implemented, and finally, mixed.method="REML" specifies any of the two estimation methods REML or ML. As mentioned above, this kinship.BLUP() allows implementing three kernels: linear, Gaussian, and Exponential. Using the wheat599 data set used in the last examples, a ten fold cross-validation using the three default kernels was implemented.

Table 8.4 shows that under a mixed model approach using the default kernels [linear (RR), Gaussian (GAUSS), and Exponential (EXP)] available in the rrBLUP library, the best predictions were observed in Env4 of data set wheat599 with the Gaussian and Exponential kernels under both metrics. The R code for reproducing the results in Table 8.4 is given in Appendix 3.

Table 8.4 Prediction performance in terms of mean square error (MSE) and Pearson’s correlation (PC) for each fold for the wheat599 (Environment 4) data set under a mixed model and three kernels: linear, Gaussian, and Exponential. These kernels are the defaults programmed in the rrBLUP library

Table 8.5 provides the results of nine kernels for the response variable of Env4. The building process was manual for the kernel matrices and the data set used for this example was the wheat599 data set. Results for each of the nine kernels are given for each fold and across the 10 folds. Table 8.5 shows that the best prediction performance was observed with the Gaussian kernel and the worst under the polynomial kernel. The R code for reproducing the results in Table 8.5 is given in Appendix 4.

Table 8.5 Prediction performance in terms of mean square error (MSE) for each fold for the wheat599 (Environment 4) data set under a mixed model, manually building the kernels linear, polynomial, sigmoid, Gaussian, exponential, AK1, AK2, AK3, and AK4

8.7 Hyperparameter Tuning for Building the Kernels

Hand-tuning kernel functions can be time-consuming and requires expert knowledge. A tuned kernel can improve the trained model, if standard kernels are insufficient for achieving a good transformation. In this section, we illustrate how to tune kernels and we compare the standard kernel with a hand-tuned kernel. As pointed out in Chap. 4, one approach to tuning is to divide the data into a training set, a tuning set, and a testing set. The training set is for training the data, the tuning set is for choosing the best hyperparameter combination, and the testing set is for evaluating the prediction performance with the best hyperparameters. However, when the data sets are small after selecting the best combination of hyperparameters, the training and tuning sets are joined into one data set and with this data set the model is refitted again with the best combination of hyperparameters, and finally, the prediction performance is evaluated with the testing set.

This conventional approach to tuning is illustrated next using the wheat599 data set. To illustrate how to choose the hyperparameter, we will work with the arc-cosine kernel where the hyperparameter to be tuned is the number of hidden layers, for which we used a grid of 10 values (1, 2,…, 9, 10). To be able to tune the number of hidden layers, first we divided the original data set into four mutually exclusive parts with four fold cross-validation. This is called the outer cross-validation strategy. Then three of these parts were used for training and the remaining for testing. A ten fold cross-validation was performed in each of the outer training sets; this is called inner cross-validation. Nine out of the ten formed the inner training set and the remaining the tuning set. Then for each of the outer folds, the grid was evaluated with 10 values in the grid for the number of hidden layers, with the inner ten fold cross-validation strategy and for each of the 10 tuning sets, the mean square error of prediction was computed for each of the 10 values of the hidden layers in the grid. Then the average mean square error of the 10 inner cross-validations (in each outer fold) was computed for each value in the grid; it was selected as the optimal number of hidden layers in the grid that provides the smallest MSE. Then the inner training and the tuning sets (inner testing) were joined together for refitting the model with the optimal number of hidden layers, and finally, the prediction performance also in terms of MSE was evaluated in the outer testing set. The average of the four values of the outer testing set is reported as the final prediction performance. Under this strategy, the optimal number of hidden layers is different for each fold.

Figure 8.2 shows that the optimal number of hidden layers is different in each fold. In folds 1 and 3 (Fig. 8.2a, c), the optimal number of hidden layers was equal to 3, in folds 3 and 4 (Fig. 8.2b, d), the optimal number of hidden layers was equal to 8. Finally, with these optimal values, the model was refitted with the information of the inner training + tuning set, and then for each fold, the mean square error (MSE) was calculated for each outer testing set; the MSEs were 0.6878 (fold 1), 0.6963 (fold 2), 0.9725 (fold 3), and 0.7212 (fold 4), with an MSE across folds equal to 0.7694. The R code for reproducing these results is given in Appendix 5.

Fig. 8.2
figure 2

Optimal number of hidden layers in each fold using a grid with values 1 to 10 and an inner ten fold cross-validation

8.8 Bayesian Kernel Methods

For a single environment, the model can be expressed as

$$ \boldsymbol{y}=\mu \mathbf{1}+\boldsymbol{u}+\boldsymbol{e}, $$
(8.8)

where μ is the overall mean, 1 is the vector of ones, and y is the vector of observations of size n. Moreover, u is the vector of genomic effects \( \boldsymbol{u}\sim \boldsymbol{N}\left(\mathbf{0},{\sigma}_u^2\boldsymbol{K}\right) \), where \( {\sigma}_u^2 \) is the genomic variance estimated from the data, and matrix K is the kernel constructed with any of the kernel methods explained above (linear, polynomial, sigmoid, Gaussian, exponential, AK1, AK2, ...). The random residuals are assumed independent with normal distribution \( \boldsymbol{e}\sim \boldsymbol{N}\left(\mathbf{0},{\sigma}_e^2\boldsymbol{I}\right) \), where \( {\sigma}_e^2 \) is the error variance.

Now the kernel Ridge regression is cast under a Bayesian framework with \( \lambda ={\sigma}_e^2/{\sigma}_u^2 \), where \( {\sigma}_e^2 \) and \( {\sigma}_u^2 \) are the residual and variance attached to u, respectively. With a flat prior to mean parameter (μ), \( {\sigma}_e^2\sim {\chi}_{v,S}^{-2} \), and the induced priors \( \boldsymbol{u}\mid {\sigma}_u^2\sim {N}_n\left(\mathbf{0},{\boldsymbol{K}\sigma}_g^2\right) \) and \( {\sigma}_u^2\sim {\chi}_{v_u,\kern0.5em {S}_u}^{-2} \), the full conditional posterior distribution of u in model (8.8) is given by

$$ {\displaystyle \begin{array}{c}f\left(\boldsymbol{u}|-\right)\propto L\left(\mu, \boldsymbol{u},{\sigma}_e^2;\boldsymbol{y}\right)f\left(\boldsymbol{u}|{\sigma}_u^2\right)\\ {}\propto \frac{1}{{\left(2{\pi \sigma}_e^2\right)}^{\frac{n}{2}}}\exp \left[-\frac{1}{2{\sigma}_e^2}\boldsymbol{y}-{1}_n\mu -{\boldsymbol{u}}^2\right]\ \frac{1}{{\left({\sigma}_u^2\right)}^{\frac{n}{2}}}\exp \left(\left[-\frac{1}{2{\sigma}_u^2}{\boldsymbol{u}}^{\mathrm{T}}{\boldsymbol{K}}^{-1}\boldsymbol{u}\right]\right)\\ {}\propto \exp \left\{-\frac{1}{2}\left[{\left(\boldsymbol{u}-\tilde{\boldsymbol{u}}\right)}^{\mathrm{T}}{\tilde{\boldsymbol{K}}}^{-1}\left(\boldsymbol{u}-\tilde{\boldsymbol{u}}\right)\right]\right\},\end{array}} $$

where \( \overset{\sim }{\boldsymbol{K}}={\left({\sigma}_u^{-2}{\boldsymbol{K}}^{-1}+{\sigma}_e^{-2}{\boldsymbol{I}}_n\right)}^{-1} \) and \( \overset{\sim }{\boldsymbol{u}}={\sigma}_e^{-2}\overset{\sim }{\boldsymbol{K}}\left(\boldsymbol{y}-{\mathbf{1}}_n\mu \right) \), and from here \( \boldsymbol{u}\mid -\sim {N}_n\left(\overset{\sim }{\boldsymbol{u}},\overset{\sim }{\boldsymbol{K}}\right) \). Then the mean/mode of u ∣ − is \( \overset{\sim }{\boldsymbol{u}}={\sigma}_e^{-2}\overset{\sim }{\boldsymbol{K}}\left(\boldsymbol{y}-{\mathbf{1}}_n\mu \right) \), which is also the BLUP of u under the mixed model equation of Henderson (1975). For this reason, model (8.8) is often referred to as GBLUP. However, here the genomic relationship matrix (GRM; or pedigree matrix P) was replaced by any kernel K; for this reason, under a Bayesian framework, we call this model a Bayesian kernel BLUP, which is reduced to the pedigree (P) or Genomic (G) BLUP when we use the GRM or pedigree matrix as the kernel.

The full conditional posterior of the rest of the parameters is equal to the GBLUP model described in Chap. 6: \( \mu \mid -\sim N\left(\overset{\sim }{\mu },{\overset{\sim }{\sigma}}_0^2\right) \), where \( {\overset{\sim }{\sigma}}_0^2=\frac{\sigma^2}{n}\kern0.5em \)and \( \overset{\sim }{\mu }=\frac{1}{n}{\mathbf{1}}_n^T\left(\boldsymbol{y}-\boldsymbol{u}\right) \); \( {\sigma}_e^{-2}\mid -\sim {\chi}_{\overset{\sim }{v},\overset{\sim }{S}}^{-2} \), w\( \mathrm{here}\ \overset{\sim }{v}=v+n \) and \( \overset{\sim }{S}=S+{\left\Vert \boldsymbol{y}-{\mathbf{1}}_n\mu -\boldsymbol{u}\right\Vert}^2; \) and \( {\sigma}_u^2\mid -\sim {\chi}_{\tilde{v}_{u},{\overset{\sim }{S}}_u}^{-2}, \) where \( \tilde{v}_{u}={v}_u+n \) and \( {\overset{\sim }{S}}_u={\boldsymbol{u}}^T{\boldsymbol{K}}^{-1}\boldsymbol{u} \). The Bayesian kernel BLUP, like the GBLUP, does not face the large p and small n problem, since due to the kernel trick, a problem of dimensionality p is converted into an n-dimensional problem.

The Bayesian kernel BLUP model (8.8) can also be implemented easily with the BGLR R package, and when the hyperparameters S-v and Su-vu are not specified, v = vu = 5 is used by default and the scale parameters are settled as in the BRR. However, a two-step process is required for its implementation: Step 1: Select and compute the kernel matrix to be used. Step 2: Use this kernel matrix to implement the model using the BGLR package.

The BGLR code to fit this model is

ETA = list( list( model = ‘RHKS’, K = K , df0 = vu, S0 = Su, R2 = 1-R2)) ) A = BGLR(y=y, ETA = ETA, nIter = 1e4, burnIn = 1e3, S0 = S, df0 = v, R2 = R2)

When individuals had more than one replication, or a sophisticated experimental design was used for data collection, the Bayesian kernel BLUP model is specified in a more general way to take into account this structure, as follows:

$$ \boldsymbol{Y}={\mathbf{1}}_n\mu +\boldsymbol{Zu}+\boldsymbol{e} $$
(8.9)

with Z the incident matrix of the genomic effects. This model cannot be fitted directly in the BGLR and some precalculus is needed first to compute the “covariance” matrix of the predictor Zu in model (8.9): K = Var(Zu) = ZKZT. The BGLR code for implementing this model is the following:

Z = model.matrix(~0+GID,data=dat_F,xlev = list(GID=unique(dat_F$GID))) K_start = Z%*%K%*%t(Z) ETA = list( list( model = ‘RHKS’, K = K_start , df0 = vu, S0 = Su, R2 = 1-R2)) ) A = BGLR(y=y, ETA = ETA, nIter = 1e4, burnIn = 1e3, S0 = S, df0 = v, R2 = R2)

To illustrate how to implement the Bayesian kernel BLUP model in BGLR, some examples are provided next.

Example 2

We again consider the prediction of grain yield (tons/ha) based on marker information. The data set used consists of 30 lines in four environments with one and two repetitions, and the genotype information consists of 500 markers for each line. The numbers of lines with one (two) repetition are 6 (24), 2 (28), 0 (30), and 3 (27) in Environments 1, 2, 3, and 4, respectively, resulting in 229 observations. The performance prediction of all these models was evaluated with 10 random partitions using a cross-validation strategy, where 80% of the complete data set was used to fit the model and the rest to evaluate the model in terms of the mean squared error (MSE) of prediction. Nine kernels were evaluated (linear=GBLUP, polynomial, sigmoid, Gaussian, exponential, AK1, AK2, AK3, and AK4). The R code for implementing this model is given in Appendix 6.

The results for all kernels (shown in Table 8.6) were obtained by iterating 10,000 times the corresponding Gibbs sampler and discarding the first 1000 of them, using the default hyperparameter values implemented in BGLR. We can observe that the worst and second worst prediction performances were obtained under the sigmoid and linear (GBLUP) kernels, while the best and second-best predictions were obtained with polynomial and Gaussian kernels. However, it is important to point out that the differences between the best and worst predictions were small.

Table 8.6 Mean squared error (MSE) of prediction across 10 random partitions, with 80% for training and the rest for testing, under nine kernel methods

8.8.1 Extended Predictor Under the Bayesian Kernel BLUP

The Bayesian kernel BLUP method can be extended, in terms of the predictor, to easily take into account the effects of other factors. For example, in addition to the genotype effect, the effects of environments and genotype × environment interaction terms can also be incorporated as

$$ \boldsymbol{y}=\mu \mathbf{1}+{\boldsymbol{Z}}_E{\boldsymbol{\beta}}_E+{\boldsymbol{u}}_{\mathbf{1}}+{\boldsymbol{u}}_{\mathbf{2}}+\boldsymbol{\varepsilon}, $$
(8.10)

where y = [y1, …, yI] are the observations collected in each of the I sites (or environments). The fixed effects of the environment are modeled with the incidence matrix of environments ZE, where the parameters to be estimated are the intercepts for each environment (βE) (other fixed effects can be incorporated into the model). In this model, \( {\boldsymbol{u}}_{\mathbf{1}}\sim \boldsymbol{N}\left(\mathbf{0},{\sigma}_{u_1}^2{\boldsymbol{K}}_{\mathbf{1}}\right) \) represents the genomic main effects, \( {\sigma}_{u_1}^2 \) is the genomic variance component estimated from the data, and \( {\boldsymbol{K}}_{\mathbf{1}}={\boldsymbol{Z}}_{u1}\boldsymbol{K}{\boldsymbol{Z}}_{u1}^{\prime } \), where Zu1 relates the genotypes to the phenotypic observations. The random effect u2 represents the interaction between the genomic effects and environments and is modeled as \( {\boldsymbol{u}}_{\mathbf{2}}\sim \boldsymbol{N}\left(\mathbf{0},{\sigma}_{u_2}^2{\boldsymbol{K}}_{\mathbf{2}}\right) \), where \( {\boldsymbol{K}}_{\mathbf{2}}={\left({\boldsymbol{Z}}_{u1}\boldsymbol{K}{\boldsymbol{Z}}_{u1}^{\prime}\right)}^{{}^{\circ}}\left({\boldsymbol{Z}}_E{{\boldsymbol{Z}}_E}^{\prime}\right) \), where ° is the Hadamard product. The BGLR specification for this Bayesian kernel BLUP model with the extended predictor is exactly the same as the GBLUP method studied in Chap. 6, but instead of using the genomic relationship matrix (linear kernel), now any of the kernels mentioned above is specified:

XE = model.matrix(~0+Env,data=dat_F)[,-1] K.E=XE%*%t(XE) Z_L = model.matrix(~0+GID,data=dat_F,xlev = list(GID=unique(dat_F$GID))) K_L=Z_L%*%K%*%t(Z_L) ###### K is the kernel matrix K_LE= K.E*K_L ETA_K=list(list(model='FIXED',X=XE),list(model='RKHS',K=K_L), list(model='RKHS',K=K_LE)) y_NA = y y_NA[Pos_tst] = NA A = BGLR(y=y_NA,ETA=ETA_K,nIter = 1e4,burnIn = 1e3,verbose = FALSE, nIter = 1e4, burnIn = 1e3, S0 = S, df0 = v, R2 = R2)

Now to illustrate the Bayesian kernel BLUP with the extended predictor described in Eq. (8.10), we used a data set that contains 30 lines in four environments, and the genotyped information is composed of 500 markers for each line. We call this data set dat_ls_E2. Now only the following kernels were implemented: linear, polynomial, sigmoid, Gaussian, exponential, AK1, and AK4. The R code for reproducing the results in Table 8.7 is given in Appendix 7. We can observe that taking into account the genotype by environment interaction in the predictor, the best prediction performance was obtained with the Gaussian kernel while the worst was obtained under the sigmoid kernel.

Table 8.7 Mean squared error (MSE) of prediction across 10 random partitions, with 80% for training and the rest for testing, under seven kernel methods with the predictor including the effects of environment + genotypes + genotype × environment interaction term. Here we used the dat_ls_E2 data set

It is important to point out that in the predictor under a Bayesian kernel BLUP using BGLR, as many terms as desired can be included, and the specification is very similar to how it was done with three terms in the predictor in this example. Using BGLR, the Bayesian kernel BLUP can be implemented for binary and ordinal response variables. Next, we provide one example for binary response variables and one for categorical response variables.

8.8.2 Extended Predictor Under the Bayesian Kernel BLUP with a Binary Response Variable

It is important to point out that it is feasible to implement the Bayesian kernel BLUP with binary response variables using the probit link function in BGLR. This implementation first requires calculating the kernel to be used; then with the following lines of code, the Bayesian kernel BLUP can be fitted for binary and categorical response variables:

XE = model.matrix(~0+Env,data=dat_F)[,-1] K.E=XE%*%t(XE) Z_L = model.matrix(~0+GID,data=dat_F,xlev = list(GID=unique(dat_F$GID))) K_L=Z_L%*%K%*%t(Z_L) K_LE= K.E*K_L ETA_K=list(list(model='FIXED',X=XE),list(model='RKHS',K=K_L), list(model='RKHS',K=K_LE)) y_NA = y y_NA[Pos_tst] = NA A = BGLR(y=y_NA,ETA=ETA_K, response_type="ordinal",nIter = 1e4,burnIn = 1e3,verbose = FALSE) Probs = A$probs[Pos_tst,]

When categorical response variables are used, two different things need to be modified to fit the model in BGLR. The first one is that we need to specify response_type="ordinal" and the other is that the outputs now are the probabilities that can be extracted with A$probs. When response_type="ordinal" is ignored, the response variable is assumed Gaussian by default.

To give an example with a binary response variable, we used the EYT Toy data set (Data_Toy_EYT. RData) that is preloaded in the BMTME library. This data set is composed of 40 lines, four environments (Bed5IR, EHT, Flat5IR, and LHT), and four response variables: DTHD, DTMT, GY, and Height. G_Toy_EYT is the genomic relationship matrix of dimension 40× 40. The first two variables are ordinal with three categories, the third is continuous (GY = Grain yield) and the last one (Height) is binary. In this example, we work with only the binary response variable (Height).

Table 8.8 gives the results of implementing the Bayesian kernel BLUP method under a binary response variable with seven kernels using the ETY Toy data set with trait Height. The best predictions using the EYT Toy data set were obtained with kernel AK4, and the worst was under kernel sigmoid. Again, we can see that, in general, most kernel methods outperform the linear kernel. The R code for reproducing the results in Table 8.8 is given in Appendix 8.

Table 8.8 Proportion of cases correctly classified (PCCC) across 10 random partitions, with 80% for training and the rest for testing, under seven kernel methods with the predictor including the effects of environment + genotypes + genotype × environment interaction term with the Data_Toy_EYT with trait Height

8.8.3 Extended Predictor Under the Bayesian Kernel BLUP with a Categorical Response Variable

The fitting process in BGLR for the categorical response variable is exactly the same as the binary response variable explained above. For this reason, the results given in Table 8.9 for the categorial response variable were obtained with the same R code given in Appendix 8 with the following two modifications: (a) y=dat_F$DTMT instead of y=dat_F$Height and b) yp_ts=apply(Probs,1,which.max) instead of yp_ts=apply(Probs,1,which.max)-1, since now the response variable has levels 1, 2, and 3.

Table 8.9 Proportion of cases correctly classified (PCCC) across 10 random partitions, with 80% for training and the rest for testing, under seven kernel methods with the predictor including effects of environment + genotypes + genotype × environment interaction term with the Data_Toy_EYT with the ordinal trait DTMT

Table 8.9 shows that the best prediction performance for the categorical response variable was observed in the polynomial kernel while the worst was under the AK4 kernel.

8.9 Multi-trait Bayesian Kernel

In BGLR, it is possible to fit multi-trait Bayesian kernel BLUP methods, and the fitting process is exactly the same as fitting multi-trait Bayesian GBLUP methods (see Chap. 6). The only difference is that instead of using a linear kernel, any kernel can be used. The basic R code for fitting multi-trait Bayesian kernel methods is given next:

y_NA = data.matrix(y) y_NA[Pos_tst,] = NA A4= Multitrait(y = y_NA, ETA=ETA_K.Gauss,resCov = list(type ="UN", S0=diag(4),df0= 5), nIter =10000, burnIn = 1000) Me_Gauss= PC_MM_f(y[Pos_tst,],A4$ETAHat[Pos_tst,],Env=dat_F$Env[Pos_tst]) A4$ETAHat[Pos_tst,]

To illustrate the fitting process of a multi-trait Bayesian kernel BLUP method, we used the EYT Toy data set (Data_Toy_EYT. RData), but using the four response variables simultaneously (even though only GY is Gaussian, we assumed that the four response variables satisfy this assumption). The R code for its implementation is given in Appendix 9. Table 8.10 gives the results of the prediction performance in terms of the mean square error across the 10 random partitions, where the best kernel for prediction differs between environment-trait combinations. The polynomial kernel and the linear kernel were the best in 4 out of 16 environment-trait combinations.

Table 8.10 Mean square error (MSE) of prediction across 10 random partitions, with 80% for training and the rest for testing, under seven kernel methods with the predictor including the effects of environment + genotypes + genotype × environment interaction term with the Data_Toy_EYT assuming that the multi-trait response is Gaussian

8.10 Kernel Compression Methods

By kernel compression methods, we mean those tools that allow us to approximate kernels without affecting the prediction accuracy very much, but gaining a significant reduction in computational resources. There are many methods for compression of kernel methods. However, in this section, we only illustrate the method proposed by Cuevas et al. (2020). The basic idea of this method consists of approximating the original kernel using a small size (m) of the original n observations (lines in the context of GS) available in the training set, which significantly reduces the required computational resources required to build the kernel matrix.

Before giving the details of the compression of kernels proposed by Cuevas et al. (2020), it is important to point out that model (8.8) can be reparametrized as Eq. (8.11) if the eigenvalue decomposition of the kernel matrix K is expressed as US1/2S1/2U,

$$ \boldsymbol{y}=\mu {1}_n+\boldsymbol{Pf}+\boldsymbol{\varepsilon}, $$
(8.11)

where \( \boldsymbol{f}\sim N\left(\mathbf{0},{\sigma}_f^2{\boldsymbol{I}}_{r,r}\right) \) (where r is the rank of K) and P = US1/2. Note that models (8.8) and (8.11) are equivalent. Model (8.11) can be fitted by the conventional Ridge regression model. This Ridge regression reparameterization most of the time is computationally very efficient, since most of the time r < min (n, p), which is common in multi-environment and/or multi-trait models. It should be noted that only r effects can be summarized and projected for P to explain the n effects without any significant loss of precision with the available information.

Next, we describe the Cuevas et al. (2020) method for compressing the kernel matrix K. First, the method approximates the original kernel matrix (K) using a smaller sub-matrix Km,m (m < n) constructed with m out of n lines. The rank of Km,m is m. Under the assumption that the row vectors are linearly independent, Williams and Seeger (2001) showed that the Nyström approximation of the kernel is as follows:

$$ \boldsymbol{K}\approx \boldsymbol{Q}={\boldsymbol{K}}_{n,m}{\boldsymbol{K}}_{m,m}^{-1}{\boldsymbol{K}}_{n,m}^{\hbox{'}}, $$

where Q will have the rank of Km,m, that is, m. The computation of this kernel is facilitated since it is not necessary to compute and store the original matrix K, since only Km,m and Kn,m are required.

Therefore, Km,m can be computed with m lines with all the p markers, that is, Xm,p. For the linear kernel (GBLUP), \( {\boldsymbol{K}}_{m,m}=\frac{{\boldsymbol{X}}_{m,p}{\boldsymbol{X}}_{m,p}^{\prime }}{p} \) and \( {\boldsymbol{K}}_{n,m}=\frac{{\boldsymbol{X}}_{n,p}{\boldsymbol{X}}_{m,p}^{\prime }}{p} \) which captures the relationship of all n lines with the m lines. Note that in the construction of Q, all the n lines and all the p markers are considered, but not all their relationships are accounted for. For example, relationships \( {\boldsymbol{K}}_{n-m,n-m}=\frac{{\boldsymbol{X}}_{n-m,p}{\boldsymbol{X}}_{n-m,p}^{\prime }}{p} \) are not considered (where n − m represents the complement to the m lines). To try to explain this, we ordered the elements of matrix Q per block, such that \( {\boldsymbol{Q}}_{n,n}=\left[\begin{array}{cc}{\boldsymbol{Q}}_{m,m}& {\boldsymbol{Q}}_{m,n-m}\\ {}{\boldsymbol{Q}}_{n-m,m}& {\boldsymbol{Q}}_{n-m,n-m}\end{array}\right] \).

Rassmussen and Williams (2006) showed that Qm,m = Km,m, Qnm,m = Knm,m, Qm,nm = Km,nm, and that the difference Knm,nm − Qnm,nm, that is, Knm,nm \( {\boldsymbol{K}}_{n-m,m}{\boldsymbol{K}}_{m,m}^{-\mathbf{1}}{\boldsymbol{K}}_{m,n-m} \) is known as the Schur complement of Km,m on Kn, n. Then, because it is assumed that Km,m and Kn,n are positive semi-definite, their difference is also positive semi-definite: \( {\boldsymbol{Q}}_{n,n}=\left[\begin{array}{cc}{\boldsymbol{K}}_{m,m}& {\boldsymbol{K}}_{m,n-m}\\ {}{\boldsymbol{K}}_{n-m,m}& {\boldsymbol{Q}}_{n-m,n-m}\end{array}\right] \). Assuming the effects of un − m ∣ um are conditional and independent, Snelson and Ghahramani (2006) and Misztal et al. (2014) proposed substituting the diagonal of the differences of Qn − m, n − m with the diagonal of Kn − m, n − m.

In the method called projected process, Seeger et al. (2003) theoretically showed that using all lines and considering the minimum Kullback–Leibler distance KL(q(u| y)‖p(u| y)) justify that the matrix K in the prior distribution of u (of model 8.8) can be substituted for the Q approximations from Nyström (Titsias 2009). That is, the random genetic vectors have a normal distribution\( \boldsymbol{u}\sim N\left(\mathbf{0},{\sigma}_u^2\boldsymbol{Q}\right) \), where \( \boldsymbol{Q}={\boldsymbol{K}}_{n,m}{\boldsymbol{K}}_{m,m}^{-\mathbf{1}}{\boldsymbol{K}}_{n,m}^{\prime }. \)

With these adjustments in the distribution of the random effects u, we used model (8.8) for prediction. It is common to estimate parameters \( {\sigma}_e^2\kern0.5em \mathrm{and}\ {\sigma}_u^2 \) of the model with the marginal likelihood and then predict the random effects using the inversion lemma, which is fast. Furthermore, if matrix Q is directly used in BGLR, there is no advantage in terms of saving computational resources using the approximate kernel. Therefore, an eigen decomposition of\( {\boldsymbol{K}}_{m,m}^{-\mathbf{1}}=\boldsymbol{U}{\boldsymbol{S}}^{-\mathbf{1}/\mathbf{2}}{\boldsymbol{S}}^{-\mathbf{1}/\mathbf{2}}{\boldsymbol{U}}^{\prime } \) is used where U are the eigenvectors of order m × m and Sm,m is a diagonal matrix of order m × m with the eigenvalues ordered from largest to smallest. These values are substituted in Q resulting in \( {\boldsymbol{u}}_{\boldsymbol{n}}\sim N\Big(\mathbf{0},{\sigma}_u^2{\boldsymbol{K}}_{n,m}\boldsymbol{U}{\boldsymbol{S}}^{-\mathbf{1}/\mathbf{2}}{\boldsymbol{S}}^{-\mathbf{1}/\mathbf{2}}{\boldsymbol{U}}^{\prime }{\boldsymbol{K}}_{n,m}^{\prime } \)), and thus, thanks to the properties of the normal distribution, model (8.8) can be expressed like model (8.11) as

$$ \boldsymbol{y}=\mu {1}_n+\boldsymbol{Pf}+\boldsymbol{\varepsilon} $$
(8.12)

Model (8.12) is similar to model (8.11), except that f is a vector of order m × 1 with a normal distribution of the form \( \boldsymbol{f}\sim \boldsymbol{N}\left(\mathbf{0},{\sigma}_f^2{\boldsymbol{I}}_{m,m}\right) \), where P = Km, nUS1/2 is now the design matrix. This implies estimating only m effects that are projected into the n-dimensional space in order to predict un and explain yn. Note that model (8.12) has a Ridge regression solution, and thus available software for Bayesian Ridge regression like BGLR R or software for conventional Ridge regression like glmnet can be used for fitting model (8.12).

In summary, according to Cuevas et al. (2020), the approximation described above consists of the following steps:

  • Step 1. Computing the following matrices, matrix Km,m from m lines of the training set.

  • Step 2. Computing matrix Kn,m.

  • Step 3. Eigenvalue decomposition of Km,m.

  • Step 4. Computing matrix P = Kn,mUS1/2.

  • Step 5. Fitting the model under a Ridge regression framework (like BGLR or glmnet) and making genomic-enabled predictions for future data.

With the following R code, the P = Kn,mUS1/2 (matrix design) can be computed under a linear kernel.

##################Linear approximate kernel######################## Sparse_linear_kernel = function(m, X){ m = m XF = X p = ncol(XF) pos_m = sample(1:nrow(XF),m) ####Step 1 compute K_m###############3 X_m = XF[pos_m,] dim(X_m) K_m = X_m% * %t(X_m)/p dim(K_m) ######Step 2 compute K_n_m########### K_n_m = XF% * %t(X_m)/p dim(K_n_m) ######Step 3 compute eigenvalue decomposition of K_m###### EVD_K_m = eigen(K_m) ####Eigenvectors U = EVD_K_m$vectors ###Eigenvalues### S = EVD_K_m$values ####Square root of the inverse of eigenvalues##### S_0.5_Inv = sqrt(1/S) #####Diagonal matrix of square root of inverse of eigenvalues### S_mat_Inv = diag(S_0.5_Inv) #####Computing matrix P P = K_n_m% * %U% * %S_mat_Inv return(P)}

To use this function to create the design matrix P = Kn,mUS1/2, you need to provide the standardized matrix of markers X, and the number of lines m, to be used for computing the approximate linear kernel. Then with this P you can implement the Ridge regression model under a Bayesian or conventional framework.

Table 8.11 indicates that the lower the value of m for building the approximate kernel, the smaller the time required for its implementation in the four response variables (Env1, …, Env4). The table also shows that the lower the value of m, the worse the prediction performance in terms of MSE and PC. However, it is really interesting that with a reduction in the training set from 599 (all data) to 264 (only 44% of the total data set), the implementation time is reduced to almost half without any significant loss in terms of prediction performance. The complete R code to reproduce the results provided in Table 8.11 is given in Appendix 10.

Table 8.11 Prediction performance in terms of mean square error (MSE) and average Pearson’s correlation (PC) under the approximate linear kernel with the method proposed by Cuevas et al. (2020)

The approximate kernel method can be used for any of the kernels studied before. The construction of the approximate Gaussian kernel matrix can be done with the following R function:

####################Gaussian kernel function###################### l2norm=function(x){sqrt(sum(x^2))} K.radial=function(x1,x2=x1, gamma=1){ exp(-gamma*outer(1:nrow(x1 <- as.matrix(x1)), 1:ncol(x2 <- t(x2)), Vectorize(function(i, j) l2norm(x1[i,]-x2[,j])^2)))} K.rad=K.radial(x1=XF,x2=XF, gamma=1/ncol(XF)) ######################Approximate Guassian kernel################# Sparse_Gaussian_kernel=function(m,X){ m=m XF=X p=ncol(XF) pos_m=sample(1:nrow(XF),m) ####Step 1 compute K_m###############3 X_m=XF[pos_m,] dim(X_m) K_m=K.radial(x1=X_m,x2=X_m, gamma=1/p) ######Step 2 compute K_n_m########### K_n_m=K.radial(x1=XF,x2=X_m, gamma=1/p) ############Step 3 compute eigenvalues decomposition of K_m########### EVD_K_m=eigen(K_m) ####Eigenvectors U=EVD_K_m$vectors ###Eigenvalues### S=EVD_K_m$values ####Square root of the inverse of eigenvalues ##### S_0.5_Inv=sqrt(1/S) #####Diagonal matrix of square root of inverse of eigenvalues### S_mat_Inv=diag(S_0.5_Inv) ######Computing matrix P P=K_n_m%*%U%*%S_mat_Inv return(P)}

With this approximate kernel method, an equivalent to Table 8.11 was reproduced, but instead of using a linear kernel, a Gaussian kernel was used. The results for the same values of m as those used in Table 8.11 (m = 15, 32, 74, 132, 264, and 599), with the wheat599 data set for each of the four response variables are given in Table 8.12.

Table 8.12 Prediction performance in terms of mean square error (MSE) and average Pearson’s correlation (PC) under the approximate Gaussian kernel with the method proposed by Cuevas et al. (2020) using the wheat599 data set

Again, we can see in Table 8.12 that the lower the training set (m) used for approximating the kernel, the lower the prediction performance, but the shorter the implementation time. However, it is very interesting to point out that with this data set, the approximation obtained when 44% (264 lines) of the original lines (599 lines) were used is quite good for the four response variables (E1,…, E4). However, the approximation to the full data set was slightly better under the approximate linear kernel (Table 8.11) than under the approximate Gaussian kernel (Table 8.12), but in general, the predictions obtained with the Gaussian kernel (full and approximated) were better than those obtained with the linear kernel (full and approximated). It is really important to point out that the R code given in Appendix 10 can be used for reproducing the results given in Table 8.12, but by replacing the function of the approximate linear kernel with the function for the approximate Gaussian kernel.

8.10.1 Extended Predictor Under the Approximate Kernel Method

Now we will illustrate the approximate kernel using the expanded predictor (8.10) that, in addition to the main effect of lines, contains the main effects of environments and the interaction term between environments and lines. Therefore, the approximate method is similar to the case of a single environment, that is, \( {\boldsymbol{u}}_{\mathbf{1}}\sim N\left(\mathbf{0},{\sigma}_{u_1}^2{\boldsymbol{Q}}^{u_1}\right) \), where \( {\boldsymbol{K}}_{\mathbf{1}}\approx {\boldsymbol{Q}}^{\boldsymbol{g}}={\boldsymbol{Z}}_{u1}\ \left({\boldsymbol{K}}_{n,m}{\boldsymbol{K}}_{m,m}^{-\mathbf{1}}{\boldsymbol{K}}_{n,m}^T\right){\boldsymbol{Z}}_{u1}^T \), whereas for the random interaction \( {\boldsymbol{u}}_{\mathbf{2}}\sim N\left(\mathbf{0},{\sigma}_{u_2}^2{\boldsymbol{Q}}^{u_2}\right) \), where \( {\boldsymbol{K}}_{\mathbf{2}}\approx {\boldsymbol{Q}}^{u_2}={\left[{\boldsymbol{Z}}_{u1}\ \left({\boldsymbol{K}}_{n,m}{\boldsymbol{K}}_{m,m}^{-\mathbf{1}}{\boldsymbol{K}}_{n,m}^{\boldsymbol{T}}\right){\boldsymbol{Z}}_{u1}^T\right]}^{{}^{\circ}}\left[{\boldsymbol{Z}}_E{{\boldsymbol{Z}}_E}^T\right] \).

Also, we decomposed \( {\boldsymbol{K}}_{m,m}^{-\mathbf{1}} \) in such a way that model (8.10) could be approximated as

$$ \boldsymbol{y}=\mu \mathbf{1}+{\boldsymbol{Z}}_E{\boldsymbol{\beta}}_E+{\boldsymbol{P}}^{u_1}\boldsymbol{f}+{\boldsymbol{P}}^{u_2}\boldsymbol{l}+\boldsymbol{\varepsilon}, $$
(8.13)

where \( {\boldsymbol{P}}^{u_1}={\boldsymbol{Z}}_{u1}\boldsymbol{P}={\boldsymbol{Z}}_{u1}{\boldsymbol{K}}_{n,m}\boldsymbol{U}{\boldsymbol{S}}^{-\mathbf{1}/\mathbf{2}} \) of order n × m, with n = n1 + n2 + … + nI, with f a vector of \( m\times 1;{\boldsymbol{P}}^{u_2}={\boldsymbol{P}}^{u_1}:{\boldsymbol{Z}}_E \) of order n × mI and the vector l is of order mI × 1, and the notation \( {\boldsymbol{P}}^{u_1}:{\boldsymbol{Z}}_E \) denotes the interaction term between the design matrix \( {\boldsymbol{P}}^{u_1} \) and ZE.

In summary, the suggested approximate method described above can be summarized as

  • Step 1. We assume that we have a matrix of markers X that contains the lines without replication, that is, each row corresponds to a different line. We assume that this matrix contains L lines (rows) and p markers (columns). Also, it is important to point out that this matrix is standardized by columns.

  • Step 2: We randomly select m lines out of L from the training set X.

  • Step 3. Next we construct matrices Km,m and KL,m, from the matrix of markers as \( {\boldsymbol{K}}_{m,m}=\frac{{\boldsymbol{X}}_{m,p}{\boldsymbol{X}}_{m,p}^{\prime }}{p} \), \( {\boldsymbol{K}}_{L,m}=\frac{{\boldsymbol{X}}_{L,p}{\boldsymbol{X}}_{m,p}^{\prime }}{p} \)

  • Step 4. Eigenvalue decomposition of Km,m.

  • Step 5. Computing matrix P = Kn,mUS1/2.

  • Step 6. Computing matrix \( {\boldsymbol{P}}^{u_1}={\boldsymbol{Z}}_{u1}\ \boldsymbol{P}={\boldsymbol{Z}}_{u1}{\boldsymbol{K}}_{n,m}\boldsymbol{U}{\boldsymbol{S}}^{-\mathbf{1}/\mathbf{2}}. \)

  • Step 7. Computing matrix \( {\boldsymbol{P}}^{u_2}={\boldsymbol{P}}^{u_1}:{\boldsymbol{Z}}_E \), where : denotes the interaction between the design matrix \( {\boldsymbol{P}}^{u_1} \) and ZE.

  • Step 8. Fitting the model under a Ridge regression framework (like BGLR or glmnet) and making genomic-enabled predictions for future data.

It is important to point out that the extension of the approximate kernel method for an extended predictor requires that some lines were studied in some environments but not in all environments. This extended approximate kernel method is expected to be efficient when the number of environments is low and the number of lines is large.

To illustrate the extended approximate kernel method, we used the Data_Toy_EYT that contains four environments, four traits, and 40 observations in each environment. Here we only used the continuous trait (GY) as the response variable. The approximate kernels were built using only the lines (40 lines) from which the training set was built with m = 4, 8, 12, 16, 20, and 40 lines. Now instead of using only one kernel, we implemented five (linear, polynomial, sigmoid, Gaussian, and exponential). The R code for reproducing the results in Table 8.13 is given in Appendix 11.

Table 8.13 Prediction performance in terms of mean square error (MSE) and average Pearson’s correlation (PC) under five approximate Gaussian kernel methods with the method proposed by Cuevas et al. (2020)

We can see in Table 8.13 that there are differences in the prediction performance using different kernels. However, the approximate kernel, even with m = 4, many times outperformed the prediction performance of the exact kernel (m = 40), which implies that when the lines are quite correlated even with a small sample size m, approximating the kernel is enough to get good prediction performance. But the time gained using a sample size m, that is less than n total number of lines, significantly reduces the implementation time, and this gain in implementation time is more relevant when the number of lines is very large, as was stated by Cuevas et al. (2020).

8.11 Final Comments

In the context of genomic prediction, arguably, genotypes and phenotypes may be linked in functional forms that are not well addressed by the linear additive models that are standard in quantitative genetics. Therefore, developing statistical machine learning models for predicting phenotypic values from all available molecular information and that are capable of capturing complex genetic network architectures is of great importance. Kernel Ridge regression methods are nonparametric prediction models proposed for this purpose. Their essence is to create a spatial distance-based relationship matrix called a kernel (Morota et al. 2013). For this reason, many kernel functions have been developed to capture complex nonlinear patterns that many times outperform conventional linear regression models in terms of prediction performance.

The kernel trick allows you to build nonlinear versions of any linear algorithms by replacing their independent variables (predictors) with a kernel function, giving them greater advantages.

  1. 1.

    The kernels are interpreted as scalar products in high-dimensional spaces.

  2. 2.

    There are kernels with great versatility and composite kernels can be built; some can be computed in closed form, while others require an iterative process.

  3. 3.

    The number of dimensions increases the complexity, and with it the risk of overfitting.

  4. 4.

    Kernel methods are an alternative to least square methods algorithms that control complexity through regularization.

  5. 5.

    Kernel methods guarantee existence and uniqueness, just like least square methods.

  6. 6.

    Nonlinear versions can be made using the kernel trick, obtaining statistical machines with great expressive capacity, but with training control.

  7. 7.

    Kernel statistical machine learning methods provide promising tools for large-scale and high-dimensional genomic data processing.

  8. 8.

    Kernel methods can also be viewed from a regression perspective and can be integrated with classical methods for gene prioritizing, prediction, and data fusion.

  9. 9.

    Kernel methods allow you to further improve the scalability of conventional machine learning methods and their versatility to work with heterogeneous inputs.

  10. 10.

    Kernel methods are remarkably flexible and elegant, as they are the predictive principle underlying most linear mixed models commonly used in plant breeding, as well as others used in spatial analysis of classification problems.

  11. 11.

    Kernel methods exploit complexity to improve prediction accuracy, but do not help very much to increase the understanding of the complexity.

  12. 12.

    Kernels based on data compression ideas are very promising for dealing with very large data sets, but software is needed, as well as more research to develop new methods and improve the existing methods.