Keywords

7.1 Introduction

Categorical and count response variables are common in many plant breeding programs. For this reason, in this chapter, the fundamental and practical issues for implementing genomic prediction models with these types of response variables are provided. Also, current open source software is used for its implementation. The prediction models proposed here are Bayesian and classical models. For each type of these models, we provide the fundamental principles and concepts, whereas their practical implementation is illustrated with examples in the genomic selection context. Taken into account are the most common terms in the predictor, such as the main effects of environments and genotypes, and interaction terms like genotype × environment interaction and marker × environment interaction. First, the models under the Bayesian framework are presented and then those under a classical framework. Next, the fundamentals and applications of the Bayesian ordinal regression model are provided.

7.2 Bayesian Ordinal Regression Model

In many applications, the response (Y) is not continuous, but rather multi-categorical in nature: ordinal (the categories of the response can be ordered, 1, …, C) or nominal (the categories have no order). For example, in plant breeding, categorical scores for resistance and susceptibility are often collected [disease: 1 (no disease), 2 (low infection), 3 (moderate infection), 4 (high infection), and 5 (complete infection)]. Another example is in animal breeding, where calving ease is scored as 1, 2, or 3 to indicate normal birth, slight difficult birth, or extreme difficult birth, respectively. The categorical traits are assigned to one of a set of mutually exclusive and exhaustive response values.

It is well documented that linear regression models in this situation are difficult to justify (Gianola 1980, 1982; Gianola and Foulley 1983) and inadequate results can be obtained, unless there is a large number of categories and the data seem to show a distribution that is approximately normal (Montesinos-López et al. 2015b).

For the ordinal model, we appeal to the existence of a latent continuous random variable and the categories are conceived as contiguous intervals on the continuous scale, as presented in McCullagh (1980), where the points of division (thresholds) are denoted as γ0, γ1, …, γC. In this way, the ordinal model assumes that conditioned to xi (covariates of dimension p), Yi is a random variable that takes values 1, …, C, with the following probabilities:

$$ {p}_{ic}=P\left({Y}_i=c\right)=P\left({\gamma}_{c-1}\le {L}_i\le {\gamma}_c\right) $$
(7.1)
$$ =F\left({\gamma}_c+{\boldsymbol{x}}_i^{\mathrm{T}}\boldsymbol{\beta} \right)-F\left({\gamma}_{c-1}+{\boldsymbol{x}}_i^{\mathrm{T}}\boldsymbol{\beta} \right),\kern0.5em c=1,\dots, C, $$

where \( {L}_i=-{\boldsymbol{x}}_i^{\mathrm{T}}\boldsymbol{\beta} +{\epsilon}_i \) is a latent variable, ϵi is a random error term with cumulative distribution function F, β = (β1, …, βp)T denotes the parameter vector associated with the effects of the covariates, and − ∞  = γ0 < γ1 < … < γC = ∞ are threshold parameters and also need to be estimated, that is, the parameter vector to be estimated in this model is \( \boldsymbol{\theta} ={\left({\boldsymbol{\beta}}^{\mathrm{T}},{\boldsymbol{\gamma}}^{\mathrm{T}},{\sigma}_{\beta}^2\right)}^{\mathrm{T}} \), where γ = (γ1, …, γC − 1)T.

All the ordinal models presented in this chapter share the property that the categories can be considered as contiguous intervals on some continuous scale, but they differ in their assumptions regarding the distributions of the latent variable. When 𝐹 is the standard logistic distribution function, the resulting model is known as the logistic ordinal model :

$$ {p}_{ic}=F\left({\gamma}_c+{\boldsymbol{x}}_i^{\mathrm{T}}\boldsymbol{\beta} \right)-F\left({\gamma}_{c-1}+{\boldsymbol{x}}_i^{\mathrm{T}}\boldsymbol{\beta} \right),\kern0.5em c=1,\dots, C, $$

where \( F(z)=\frac{1}{1+\exp \left(-z\right)}. \) When 𝐹 is the standard normal distribution function, the resulting model is the ordinal probit model :

$$ {p}_{ic}=F\left({\gamma}_c+{\boldsymbol{x}}_i^{\mathrm{T}}\boldsymbol{\beta} \right)-F\left({\gamma}_{c-1}+{\boldsymbol{x}}_i^{\mathrm{T}}\boldsymbol{\beta} \right),\kern0.5em c=1,\dots, C, $$

w\( \mathrm{here}\ F(z)={\int}_{-\infty}^z\frac{1}{\sqrt{2\pi }}\exp \left(-\frac{u^2}{2}\right) du. \) When the response value only takes two values, model (7.1) is reduced to the binary regression model, which in the first case is better known as logistic regression, while in the second case, it is known as the probit regression model. For this reason, logistic regression or probit regression is a particular case of ordinal regression even under a Bayesian framework; for this reason, the Bayesian frameworks that will be described for ordinal regression in this chapter are reduced to logistic or probit regression when only one threshold parameter needs to be estimated (γ1), or equivalently this is set to 0 and an intercept parameter (β0) is added to the vector coefficients that also need to be estimated, β = (β0, β1, …, βp)T. In this model, the following development will be concentrated in the probit ordinal regression model.

A Bayesian specification for the ordinal regression is very similar to the linear model described in Chap. 6, and assumes the following prior distribution for the parameter vector \( \boldsymbol{\theta} ={\left({\boldsymbol{\beta}}^T,{\boldsymbol{\gamma}}^T,{\sigma}_{\beta}^2\right)}^{\mathrm{T}}: \)

$$ f\left(\boldsymbol{\theta} \right)\propto \frac{1}{{\left({\sigma}_{\beta}^2\right)}^{\frac{p}{2}}}\exp \left(\left[-\frac{1}{2{\sigma}_{\beta}^2}{\boldsymbol{\beta}}^{\mathrm{T}}\boldsymbol{\beta} \right]\right)\frac{{\left(\frac{S_{\beta }}{2}\right)}^{\frac{v_{\beta }}{2}}}{\Gamma \left(\frac{v_{\beta }}{2}\right){\left({\sigma}_{\beta}^2\right)}^{1+\frac{v_{\beta }}{2}}}\exp \left(-\frac{S_{\beta }}{2{\sigma}_{\beta}^2}\right) $$

That is, a flat prior is assumed for the threshold parameter γ, a multivariate normal distribution for the vector of beta coefficients, \( \boldsymbol{\beta} \mid {\sigma}_{\beta}^2\sim {N}_p\left(\mathbf{0},{\sigma}_{\beta}^2{\boldsymbol{I}}_p\right) \), and a scaled inverse chi-squared for \( {\sigma}_{\beta}^2. \) From now on, this model will be referred to as BRR in this chapter in analogy to the Bayesian linear regression and the way that this is implemented in the BGLR R package. Similar to the genomic linear regression model, the posterior distribution of the parameter vector does not have a known form and a Gibbs sampler is used to explore this; for this reason, in the coming lines, the Gibbs sampling method proposed by Albert and Chib (1993) is described. To make it possible to derive the full conditional distributions, the parameter vector is augmented with a latent variable in the representation of model (7.1).

The joint probability density function of the vector of observations, Y = (Y1, …, Yn)T, and the latent variables L = (L1, …, Ln)T, evaluated in the vector values y = (y1, …, yn)T and l = (l1, …, ln)T, is given by

$$ {\displaystyle \begin{array}{c}f\left(\boldsymbol{y},\boldsymbol{l}|\boldsymbol{\beta}, \boldsymbol{\gamma} \right)=\prod \limits_{i=1}^n{f}_{L_i}\left({l}_i\right){I}_{\left({\gamma}_{y_i-1},{\gamma}_{y_i}\right)}\left({l}_i\right)\\ {}=\prod \limits_{i=1}^n\exp \left[-\frac{1}{2}{\left({l}_i+{\boldsymbol{x}}_i^{\mathrm{T}}\boldsymbol{\beta} \right)}^2\right]{I}_{\left({\gamma}_{y_i-1},{\gamma}_{y_i}\right)}\left({l}_i\right)\\ {}=\exp \left[-\frac{1}{2}\sum \limits_{i=1}^n{\left({l}_i+{\boldsymbol{x}}_i^{\mathrm{T}}\boldsymbol{\beta} \right)}^2\right]\prod \limits_{i=1}^n{I}_{\left({\gamma}_{y_i-1},{\gamma}_{y_i}\right)}\left({l}_i\right),\end{array}} $$

where IA is the indicator function of a set A, and \( {f}_{L_i}\left({l}_i\right) \) is the density function of the latent variable that in the ordinal probit regression model corresponds to the normal distribution with mean \( -{\boldsymbol{x}}_i^{\mathrm{T}}\boldsymbol{\beta} \) and unit variance. Then the full conditional of the jth component of β, βj, given the rest of the parameters and the latent variables, is given by

$$ {\displaystyle \begin{array}{c}f\left({\beta}_j|-\right)\propto f\left(\boldsymbol{y},\boldsymbol{l}|\boldsymbol{\beta}, \boldsymbol{\gamma} \right)f\left(\boldsymbol{\beta} |{\sigma}_{\beta}^2\right)\\ {}\propto \prod \limits_{i=1}^n{f}_{L_i}\left({l}_i\right){I}_{\left({\gamma}_{y_i-1},{\gamma}_{y_i}\right)}\left({l}_i\right)f\left(\boldsymbol{\beta} |{\sigma}_{\beta}^2\right)\\ {}\propto \exp \left[-\frac{1}{2}\sum \limits_{i=1}^n{\left({e}_{ij}+{x}_{ij}{\beta}_j\right)}^2-\frac{1}{2{\sigma}_{\beta}^2}{\beta}_j^2\right]\\ {}\propto \exp \left\{-\frac{1}{2{\tilde{\sigma}}_{\beta_j}^2}{\left({\beta}_j-{\tilde{\beta}}_j\right)}^2\right\},\end{array}} $$

where\( {e}_{ij}={l}_i+{\sum}_{\genfrac{}{}{0pt}{}{k=1}{k\ne j}}^p{x}_{1k}{\beta}_k \), l = (l1, …, ln)T, xj = [x1j, …, xnj]Tej = [e1j, …, enj]T, \( {\overset{\sim }{\sigma}}_{\beta_j}^2={\left({\sigma}_{\beta}^{-2}+{\boldsymbol{x}}_j^{\mathrm{T}}{\boldsymbol{x}}_j\right)}^{-1} \), and \( {\overset{\sim }{\beta}}_j=-{\overset{\sim }{\sigma}}_{\beta_j}^2\left({\boldsymbol{x}}_j^{\mathrm{T}}{\boldsymbol{e}}_j\right) \). So, the full conditional distribution of βj is \( N\left({\overset{\sim }{\beta}}_j,{\overset{\sim }{\sigma}}_{\beta_j}^2\right) \).

Now, the full conditional distribution of the threshold parameter γc is

$$ {\displaystyle \begin{array}{c}f\left({\gamma}_c|-\right)\propto f\left(\boldsymbol{y},\boldsymbol{l}|\boldsymbol{\beta}, \boldsymbol{\gamma} \right)\\ {}\propto \prod \limits_{i=1}^n{I}_{\left({\gamma}_{y_i-1},{\gamma}_{y_i}\right)}\left({l}_i\right)\\ {}\propto \prod \limits_{i\in \left\{i:{y}_i=c\right\}}{I}_{\left({\gamma}_{c-1},{\gamma}_c\right)}\left({l}_i\right)\prod \limits_{i\in \left\{i:{y}_i=c+1\right\}}{I}_{\left({\gamma}_c,{\gamma}_{c+1}\right)}\left({l}_i\right)\\ {}\propto {I}_{\left({a}_c,{b}_c\right)}\left({\gamma}_c\right),\end{array}} $$

where ac =  max {li : yi = c} and bc =  min {li : yi = c + 1}. So γc ∣  −  ∼ U(ac, bc), c = 1, 2, …, C − 1. Now, the full conditional distribution of the latent variables is

$$ f\left(\boldsymbol{l}|-\right)\propto f\left(\boldsymbol{y},\boldsymbol{l}|\boldsymbol{\beta}, \boldsymbol{\gamma} \right)\propto \prod \limits_{i=1}^n{f}_{L_i}\left({l}_i\right){I}_{\left({\gamma}_{y_i-1},{\gamma}_{y_i}\right)}\left({l}_i\right) $$

from which we have that conditional on the rest of parameters, the latent variables are independent truncated normal random variables, that is, \( {L}_i\mid -\sim N\left(-{\boldsymbol{x}}_i^{\mathrm{T}}\boldsymbol{\beta}, 1\right) \) truncated in the interval \( \left({\gamma}_{y_i-1},{\gamma}_{y_i}\right) \), i = 1, …, n.

The full conditional distribution of \( {\sigma}_{\beta}^2 \) is the same as in the linear regression model described in Chap. 6, so a Gibbs sampler for exploration of the posterior distribution of the parameters of the models can be implemented by iterating the following steps:

  1. 1.

    Initialize the parameters: βj = βj0, j = 1, …, p, γc = γc0, c = 1, …, C − 1, li = li0, i = 1, …, n, and \( {\sigma}_{\beta}^2={\sigma}_{\beta 0}^2 \).

  2. 2.

    For each i = 1, …, n, simulate li from the normal distribution \( N\left(-{\boldsymbol{x}}_i^{\mathrm{T}}\boldsymbol{\beta}, 1\right) \) truncated in \( \left({\gamma}_{y_i-1},{\gamma}_{y_i}\right) \).

  3. 3.

    For each j = 1, …, p, simulate from the full conditional distribution of βj, that is, from a \( N\left({\overset{\sim }{\beta}}_j,{\overset{\sim }{\sigma}}_{\beta_j}^2\right), \) where \( {\overset{\sim }{\sigma}}_{\beta_j}^2={\left({\sigma}_{\beta}^{-2}+{\boldsymbol{x}}_j^{\mathrm{T}}{\boldsymbol{x}}_j\right)}^{-1} \) and \( {\overset{\sim }{\beta}}_j=-{\overset{\sim }{\sigma}}_{\beta_j}^2\left({\boldsymbol{x}}_j^{\mathrm{T}}{\boldsymbol{e}}_j\right) \).

  4. 4.

    For each c = 1, 2, …, C − 1, simulate from the full conditional distribution of γc, γc ∣  −  ∼ U(ac, bc), where ac =  max {li : yi = c} and bc =  min {li : yi = c + 1}.

  5. 5.

    Simulate \( {\sigma}_{\beta}^2 \) from a scaled inverse chi-squared distribution with parameters \( {\overset{\sim }{v}}_{\beta }={v}_{\beta }+p \) and \( {\overset{\sim }{S}}_{\beta }={S}_{\beta }+{\boldsymbol{\beta}}^{\mathrm{T}}\boldsymbol{\beta} \), that is, from a \( {\chi}^{-2}\left({\overset{\sim }{v}}_{\beta },{\overset{\sim }{S}}_{\beta}\right). \)

  6. 6.

    Repeat 1–5 until a burning period and the desired number of samples are reached.

A similar Gibbs sampler is implemented in the BGLR R package, and the hyperparameters are established in a similar way as the linear regression models described in Chap. 6. When the hyperparameters Sβ and vβ are not specified, by default BGLR is assigned vβ = 5 and to Sβ a value such that the prior mode of \( {\sigma}_{\beta}^2 \) (χ−2(vβ, Sβ)) is equal to half of the “total variance” of the latent variables (Pérez and de los Campos 2014). An implementation of this model can be done with a fixed value of the variance component parameter \( {\sigma}_{0\beta}^2 \) by choosing a very large value of the degree of freedom parameter (vβ = 1010) and taking the scale value parameter \( {S}_{\beta }={\sigma}_{0\beta}^2\left({v}_{\beta }+2\right) \).

The basic R code with the BGLR implementation of this model is as follows:

ETA = list( list( X=X, model=‘BRR’ ) ) A = BGLR(y=y, response_type='ordinal', ETA=ETA, nIter = 1e4, burnIn = 1e3) Probs = A$Probs

where the predictor component is specified in ETA: argument X specifies the design matrix, whereas the argument model describes the priors of the model’s parameters. The response variable vector is given in y, and the ordinal model is specified in the response_type argument. In the last line, the posterior means of the probabilities of each category are extracted, and can be used to give an estimation of a particular metric when comparing models or to evaluate the performance of this model, which will be explained later. An application of this ordinal model is given by Montesinos-López et al. (2015a).

Similar to the linear regression model in Chap. 6, when the number of markers to be used in genomic prediction is very large relative to the number of observations, p ≫ n, the dimension of the posterior distribution (number of parameters in θ, p + c − 1 + 1) to be explored can be reduced by simulating the genetic vector effect b = (b1, …, bn)T, where \( {b}_i={\boldsymbol{x}}_i^{\mathrm{T}}\boldsymbol{\beta} \), i = 1, …, n, instead of β. Because \( \boldsymbol{\beta} \mid {\sigma}_{\beta}^2\sim {N}_p\left(\mathbf{0},{\sigma}_{\beta}^2{\boldsymbol{I}}_p\right) \), the induced distribution of b = (b1, …, bn)T is \( \boldsymbol{b}\mid {\sigma}_g^2\sim {N}_n\left(\mathbf{0},{\sigma}_g^2\boldsymbol{G}\right) \), where \( {\sigma}_g^2=p{\sigma}_{\beta}^2 \) and \( \boldsymbol{G}=\frac{1}{p}{\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{X} \). Then, with this and assuming a scaled inverse chi-squared distribution as prior for \( {\sigma}_g^2 \), \( {\sigma}_g^2\sim {\chi}^{-2}\left({v}_g,{S}_g\right) \), and a flat prior for the threshold parameters (γ), an ordinal probit GBLUP Bayesian regression model specification of model (7.1) is given by

$$ {p}_{ic}=P\left({Y}_i=c\right)=\Phi \left({\gamma}_c+{b}_i\right)-\Phi \left({\gamma}_{c-1}+{b}_i\right),\kern0.5em c=1,\dots, C, $$
(7.2)

where now Li =  − bi + ϵi is the latent variable, and Φ is the cumulative normal standard distribution. In matrix form the model for the latent variable can be specified as L = b + ϵ, where L = (L1, …, Ln)T and ϵ~Nn(0, In). A Gibbs sampler of the posterior of the parameters of this model can be obtained similarly as was done for model (7.1). Indeed, the full conditional density of b is given by

$$ {\displaystyle \begin{array}{c}f\left(\boldsymbol{b}|-\right)\propto f\left(\boldsymbol{y},\boldsymbol{l}|\boldsymbol{b},\boldsymbol{\gamma} \right)f\left(\boldsymbol{b}|{\sigma}_g^2\right)\\ {}\propto \prod \limits_{i=1}^n{f}_{L_i}\left({l}_i\right){I}_{\left({\gamma}_{y_i-1},{\gamma}_{y_i}\right)}\left({l}_i\right)f\left(\boldsymbol{b}|{\sigma}_g^2\right)\\ {}\propto \exp \left[-\frac{1}{2}{\left(\boldsymbol{l}+\boldsymbol{b}\right)}^{\mathrm{T}}\left(\boldsymbol{l}+\boldsymbol{b}\right)-\frac{1}{2{\sigma}_g^2}{\boldsymbol{b}}^{\mathrm{T}}{\boldsymbol{G}}^{-1}\boldsymbol{b}\right]\\ {}\propto \exp \left[-\frac{1}{2}{\left(\boldsymbol{b}-\tilde{\boldsymbol{b}}\right)}^{\mathrm{T}}{\tilde{\boldsymbol{\Sigma}}}_b\left(\boldsymbol{b}-\tilde{\boldsymbol{b}}\right)\right],\end{array}} $$

where \( {\overset{\sim }{\boldsymbol{\Sigma}}}_b={\left({\sigma}_g^{-2}{\boldsymbol{G}}^{-1}+{\boldsymbol{I}}_n\right)}^{-1} \) and \( \overset{\sim }{\boldsymbol{b}}=-{\overset{\sim }{\boldsymbol{\Sigma}}}_b\boldsymbol{l} \). So the full conditional distribution of b is \( {N}_n\left(\overset{\sim }{\boldsymbol{b}},{\overset{\sim }{\boldsymbol{\Sigma}}}_b\right) \). Then, a Gibbs sampler exploration of this model can be done by iterating the following steps:

  1. 1.

    Initialize the parameters: bi = bi0, i = 1, …, n, γc = γc0, c = 1, …, C − 1, li = li0, i = 1, …, n, and \( {\sigma}_{\beta}^2={\sigma}_{\beta 0}^2 \).

  2. 2.

    For each i = 1, …, n, simulate li from the normal distribution \( N\left(-{\boldsymbol{x}}_i^{\mathrm{T}}\boldsymbol{\beta}, 1\right) \) truncated in \( \left({\gamma}_{y_i-1},{\gamma}_{y_i}\right) \).

  3. 3.

    Simulate b from \( {N}_n\left(\overset{\sim }{\boldsymbol{b}},{\overset{\sim }{\boldsymbol{\Sigma}}}_b\right) \), with \( {\overset{\sim }{\boldsymbol{\Sigma}}}_b={\left({\sigma}_g^{-2}{\boldsymbol{G}}^{-1}+{\boldsymbol{I}}_n\right)}^{-1} \) and \( \overset{\sim }{\boldsymbol{b}}=-{\overset{\sim }{\boldsymbol{\Sigma}}}_b\boldsymbol{l} \).

  4. 4.

    For each c = 1, 2…, C − 1, simulate from the full conditional distribution of γc γc ∣  −  ∼ U(ac, bc), where ac =  max {li : yi = c} and bc =  min {li : yi = c + 1}.

  5. 5.

    Simulate \( {\sigma}_g^2 \) from a scaled inverse chi-squared distribution with parameters \( \tilde{v}_{g}={v}_g+n \) and \( {\overset{\sim }{S}}_g={S}_g+{\boldsymbol{b}}^{\mathrm{T}}\boldsymbol{b} \), that is, from a \( {\chi}^{-2}\left(\tilde{v}_{g},{\overset{\sim }{S}}_g\right). \)

  6. 6.

    Repeat 1–5 until a burning period and the desired number of samples are reached.

This model can also be implemented with the BGLR package:

ETA = list( list( K=G, model=‘RKHS’ ) ) A = BGLR(y=y, response_type='ordinal', ETA=ETA, nIter = 1e4, burnIn = 1e3) Probs = A$Probs

where instead of specifying the design matrix and the prior model to the associated regression coefficients, now the genomic relationship matrix G is given, and the corresponding model in this case corresponds to RKHS, which can also be used when another type of information between the lines is available, for example, the pedigree relationship matrix.

Like the Bayesian linear regression model, other variants of model BRR can be obtained by adopting different prior models for the beta coefficients (β): FIXED, BayesA, BayesB, BayesC, or BL, for example. See Chap. 6 for details of each of these prior models for the regression coefficients. Indeed, the full conditional distributions used to implement a Gibbs sampler in each of these ordinal models are the same as the corresponding Bayesian linear regression models, except that the full conditional distribution for the variance component of random errors is not needed and two full conditional distributions (for the threshold parameters and the latent variables) are added, where now the latent variable will play the role of the response value in the Bayesian linear regression. These models can also be implemented in the BGLR package.

For example, a Gibbs sampler implementation for ordinal model (7.1) with a BayesA prior (see Chap. 6) can be done by following steps:

  1. 1.

    Initialize the parameters: βj = βj0, j = 1, …, p, γc = γc0, c = 1, …, C − 1, li = li0, i = 1, …, n, and \( {\sigma}_{\beta}^2={\sigma}_{\beta 0}^2 \).

  2. 2.

    For each i = 1, …, n, simulate li from the normal distribution \( N\left(-{\boldsymbol{x}}_i^{\mathrm{T}}\boldsymbol{\beta}, 1\right) \) truncated in \( \left({\gamma}_{y_i-1},{\gamma}_{y_i}\right) \).

  3. 3.

    For each j = 1, …, p, simulate from the full conditional distribution of βj, that is, from a \( N\left({\overset{\sim }{\beta}}_j,{\overset{\sim }{\sigma}}_{\beta_j}^2\right), \) where \( {\overset{\sim }{\sigma}}_{\beta_j}^2={\left({\sigma}_{\beta_j}^{-2}+{\boldsymbol{x}}_j^{\mathrm{T}}{\boldsymbol{x}}_j\right)}^{-1} \) and \( {\overset{\sim }{\beta}}_j={\overset{\sim }{\sigma}}_{\beta_j}^2\left({\boldsymbol{x}}_j^{\mathrm{T}}{\boldsymbol{e}}_j\right) \).

  4. 4.

    For each c = 1, 2, …, C − 1, simulate from the full conditional distribution of γc, γc ∣  −  ∼ U(ac, bc), where ac =  max {li : yi = c} and bc =  min {li : yi = c + 1}.

  5. 5.

    Simulate from the full conditional of each \( {\sigma_{\beta}}_j^2 \)

    $$ {\sigma}_{\beta_j}^2\mid \boldsymbol{\gamma}, \boldsymbol{l},{\boldsymbol{\beta}}_0,{\boldsymbol{\sigma}}_{-j}^2,{S}_{\beta },\sim {\chi}_{\tilde{v}_{j},{{\overset{\sim }{S}}_{\beta}}_j}^{-2}, $$

    where \( {\overset{\sim }{v}}_{\beta_j}={v}_{\beta }+1 \), \( \mathrm{scale}\ \mathrm{parameter}\ {\overset{\sim }{S}}_{\beta_j}={S}_{\beta }+{\beta}_j^2 \), and \( {\boldsymbol{\sigma}}_{-j}^2 \) is the vector \( {\boldsymbol{\sigma}}_{\beta}^2=\left({\sigma}_{\beta_1}^2,\dots, {\sigma}_{\beta_p}^2\right) \) but without the jth entry.

  6. 6.

    Simulate from the full conditional of Sβ

    $$ {\displaystyle \begin{array}{c}f\left({S}_{\beta }|-\right)\propto \left[\prod \limits_{j=1}^pf\left({\sigma}_{\beta_j}^2|{S}_{\beta }\ \right)\right]f\left({S}_{\beta}\right)\\ {}\propto \prod \limits_{j=1}^p\left[\frac{{\left(\frac{S_{\beta }}{2}\right)}^{\frac{v_{\beta }}{2}}}{\Gamma \left(\frac{v_{\beta }}{2}\right){\left({\sigma}_{\beta_j}^2\right)}^{1+\frac{v_{\beta }}{2}}}\exp \left(-\frac{S_{\beta }}{2{\sigma}_{\beta_j}^2}\right)\right]{S}_{\beta}^{s-1}\exp \left(-{rS}_{\beta}\right)\\ {}\propto {S}_{\beta}^{s+\frac{pv_{\beta }}{2}-1}\exp \left[-\left(r+\frac{1}{2}\sum \limits_{j=1}^p\frac{1}{\sigma_{\beta j}^2}\right){S}_{\beta}\right]\end{array}} $$

    which corresponds to the kernel of the gamma distribution with rate parameter \( \overset{\sim }{r}=r+\frac{1}{2}{\sum}_{j=1}^p\frac{1}{\sigma_{\beta j}^2} \) and shape parameter \( \overset{\sim }{s}=s+\frac{pv_{\beta }}{2} \), and so, \( {S}_{\beta}\mid -\sim \mathrm{Gamma}\left(\overset{\sim }{r},\overset{\sim }{s}\right). \)

  7. 7.

    Repeat steps 2–6 depending on how many values you wish to simulate of the parameter vector (\( {\boldsymbol{\beta}}^{\mathrm{T}},{\boldsymbol{\sigma}}_{\beta}^2,{\boldsymbol{\gamma}}^{\mathrm{T}} \))

The implementation of BayesA for an ordinal response variable in BGLR is as follows:

ETA = list( list( X=X, model=‘BayesA’ ) ) A = BGLR(y=y, response_type='ordinal', ETA=ETA, nIter = 1e4, burnIn = 1e3) Probs = A$Probs

The implementation of the other Bayesian ordinal regression models, BayesB, BayesC, and BL, in BGLR is done by only replacing in ETA the corresponding model, as was commented before, see Chap. 6 for details and the difference between all these prior models for the regression coefficients.

7.2.1 Illustrative Examples

Example 1 BRR and GBLUP

To illustrate how these models work and how they can be implemented, here we used a toy data set with an ordinal response with five levels and consisting of 50 lines that were planted in three environments (a total of 150 observations) and the genotyped information of 1000 markers for each line. To evaluate the performance of the described models (BRR, BayesA, GBLUP, BayesB, BayesC, and BL), 10 random partitions were used in a cross-validation strategy, where 80% of the complete data set was used to train the model and the rest to evaluate the model in terms of the Brier score (BS) and the proportion of cases correctly classified (PCCC).

Six models were implemented with the BGLR R package, using a burn-in period equal to 1000 and 10,000 iterations, and the default hyperparameter for prior distribution. The results are shown in Table 7.1, where we can appreciate similar performance in terms of the Brier score metric in all models, and a similar but poor performance of all models when the proportion of cases correctly classified (PCCC) was used. With the BRR model only in 4 out of 10 partitions, the values of the PCCC were slightly greater or equal to what can be obtained by random assignation (1/5), while with the GBLUP model this happened in 3 out of 10 partitions. A similar behavior is obtained with the rest of the models. On average, the PCCC values were 0.1566, 0.1466, 0.15, 0.1466, 0.1533, and 0.1433 for the BRR, GBLUP, BayesA, BayesB, BayesC, and Bayes Lasso (BL) models, respectively.

Table 7.1 Brier score (BS) and proportion of cases correctly classified (PCCC) across 10 random partitions, with 80% of the total data set used for training and the rest for testing, under model (7.1) with different priors models to the marker coefficients: ordinal Bayesian Ridge regression model (BRR), BayesA, BayesB, BayesC, and BL (model (7.1)); and under the ordinal GBLUP regression model (7.2)

The R code to reproduce the results in Table 7.1 is given in Appendix 1.

In some applications, additional information is available, such as the sites (locations) where the experiments were conducted, environmental covariates, etc., which can be taken into account to improve the prediction performance.

One extension of model (7.1) that takes into account environment effects and environment–marker interaction is given by

$$ {p}_{ic}=P\left({Y}_i=c\right)=F\left({\gamma}_c-{\eta}_i\right)-F\left({\gamma}_{c-1}-{\eta}_i\right),c=1,\dots, C, $$
(7.3)

where now the predictor \( {\eta}_i={\boldsymbol{x}}_{Ei}^{\mathrm{T}}{\boldsymbol{\beta}}_E+{\boldsymbol{x}}_i^{\mathrm{T}}\boldsymbol{\beta} +{\boldsymbol{x}}_{EM i}^T{\boldsymbol{\beta}}_{EM} \), in addition to the marker effects (\( {\boldsymbol{x}}_i^{\mathrm{T}}\boldsymbol{\beta} \)) in the second summand, is included the environment and the environment–marker interaction effects, respectively. In the latent variable representation, this model is equivalently expressed as

$$ P\left({Y}_i=c\right)=P\left({\gamma}_{c-1}\le {L}_i\le {\gamma}_c\right),c=1,\dots, C, $$

where L = (L1, …, Ln)T = XEβE + Xβ + XEMβEM + ϵ is the vector with latent random variables of all observations, ϵ ∼ Nn(0, In) is a random error vector, XE, X, and XEM are the design matrices of the environments, markers, and environment–marker interactions, respectively, while βE and βEM are the vectors of the environment effects and the interaction effects, respectively, with a prior distribution that can be specified as was done for β. In fact, with the BGLR function, it is also possible to implement all these extensions, since it allows using any of the several priors included here: FIXED, BRR, BayesA, BayesB, BayesC, and BL. For example, the basic BGLR code to implement model (7.3) with a flat prior (“FIXED”) for the environment effects, a BRR prior for marker effects and for the environment–marker interaction effects, is as follows:

X = scale(X) #Environment matrix design XE = model.matrix(~Env,data=dat_F)[,-1] #Environment-marker interaction matrix design XEM = model.matrix(~0+X:Env,data=dat_F) ETA = list(list(X=XE, model='FIXED'), list(X=X, model='BRR'), list(X=XEM, model='BRR'))#Predictor A = BGLR(y=y,response_type='ordinal',ETA=ETA, nIter = 1e4,burnIn = 1e3,verbose = FALSE) Probs = A$probs

where dat_F is the data file that contains all the information of how the data was collected (GID: Lines or individuals; Env: Environment; y: response variable of the trait). Other desired prior models to beta coefficients of each predictor component are obtained only by replacing the “model” argument of each of the three components of the predictor. For example, for a BayesA prior model for the marker effects, in the second sub-list we must use model='BayesA'.

The latent random vector of model (7.1) under the GBLUP specification, plus genotypic and environment×genotypic interaction effects, takes the form

$$ \boldsymbol{L}={\boldsymbol{X}}_E{\boldsymbol{\beta}}_E+{\boldsymbol{Z}}_L\boldsymbol{g}+{\boldsymbol{Z}}_{LE}\boldsymbol{gE}+\boldsymbol{\epsilon} $$
(7.4)

which is like model (6.7) in Chap. 6, where ZL and ZLE are the incident matrices of the genotypes and environment×genotype interaction effects, respectively, and g and gE are the corresponding random effects which have distributions \( {N}_J\left(\boldsymbol{O},{\sigma}_g^2\boldsymbol{G}\right) \) and \( {N}_J\left(\boldsymbol{O},{\sigma}_{gE}^2\left({\boldsymbol{I}}_I\boldsymbol{\bigotimes}\boldsymbol{G}\right)\right) \), respectively, where J is the number of different lines in the data set. This model can be trained with the BGLR package as follows:

I = length(unique(dat_F$Env)) XE = model.matrix(~0+Env,data=dat_F)[,-1] Z_L = model.matrix(~0+GID,data=dat_F,xlev = list(GID=unique(dat_F$GID))) K_L = Z_L %*%G%*%t(Z_L) Z_LE = model.matrix(~0+GID:Env,data=dat_F, xlev = list(GID=unique(dat_F$GID),Env = unique(dat_F$Env))) K_LE = Z_LE%*%kronecker(diag(I),G)%*%t(Z_LE) ETA = list( list(model='FIXED',X=XE), list( model = ‘RHKS’, K = K_L )), list(model='RKHS',K=K_LE ) A = BGLR(y, response_type = “ordinal”, ETA = ETA, nIter = 1e4, burnIn = 1e3) Probs = A$Probs

where dat_F is as before, that is, the data file that contains all the information of how the data were collected. Similar specification can be done when other kinds of covariates are available, for example, environmental covariates.

Example 2 Environments + Markers + Marker×Environment Interaction

This example illustrates how models (7.3) and (7.4) can be fitted and used for prediction. Here we used the same data set as in Example 1, but now the environment and environment×marker effects are included in the predictor. For model (7.3), to the environment effect a flat prior is assigned [FIXED, a normal distribution with mean 0 and very large variance (1010)], and one of BRR, BayesA, BayesB, BayesC, or BL prior model is assigned to the marker and marker×environment interaction effects. For model (7.4), a flat prior is assigned for environment effects. The performance evaluation was done using the same 10 random partitions used in Example 1, where 80% of the complete data set was used for training the model and the rest for testing, in which the Brier score (BS) and the proportion of correctly classified (PCCC) metrics were computed.

The results are shown in Table 7.2. They indicate an improved prediction performance of these models compared to the models fitted in Example 1 which only takes into account the marker effects (see Table 7.1). However, this improvement is only slightly better under the Brier score, because the reduction in the average BS across all 10 partitions was 1.55, 3.74, 0.96, 0.83, 1.41, and 1.32% for the BRR, GBLUP, BayesA, BayesB, BayesC, and BL, respectively. A more notable improvement was obtained with the PCCC criteria, where now for all models in about 8 out of the 10 partitions, the value of this metric was greater than the one obtained by random chance only. Indeed, the average values across the 10 partitions were 0.27, 0.28, 0.28, 0.27, 0.28, and 0.27, for the BRR, GBLUP, BayesA, BayesB, BayesC, and BL, respectively. This indicates 74.47, 90.90, 86.67, 86.36, 80.43, and 86.05% improvement for these models with respect to their counterpart models when only marker effects or genomic effects were considered. So, greater improvement was observed with the GBLUP model with both metrics, but finally the performance of all models is almost undistinguishable but with an advantage in time execution of the GBLUP model with respect to the rest. These issues were pointed out in Chaps. 5 and 6. This difference is even greater when the number of markers is larger than the number of observations.

Table 7.2 Brier score (BS) and proportion of cases correctly classified (PCCC) across 10 random partitions, with 80% of the total data set used for training and the rest for testing, under model (7.3) with different prior models for the marker effects and the environment×marker interaction: BRR, BayesA, BayesB, BayesC, and BL; and under the ordinal GBLUP regression model (7.4)

Example 3 Binary Traits

For this example, we used the EYT Toy data set consisting of 40 lines, four environments (Bed5IR, EHT, Flat5IR, and LHT), and a response binary variable based on plant Height (0 = low, 1 = high). For this example, marker information is not available, only the genomic relationship matrix for the 40 lines. So, only the models (M3 and M4) in (7.3) and (7.4) are fitted, and the performance prediction for these models was evaluated using cross-validation. Also, in this comparison model, (M5) (7.5) is added but without the line×environment interaction effects, that is, only the environment effect and the genetic effects are taken into account in the linear predictor:

$$ \boldsymbol{L}={\boldsymbol{X}}_E{\boldsymbol{\beta}}_E+{\boldsymbol{Z}}_L\boldsymbol{g}+\boldsymbol{\epsilon} $$
(7.5)

The results are presented in Table 7.3 with the BS and PCCC metrics obtained in each partition of the random CV strategy. From this we can appreciate that the best performance with both metrics was obtained with the model that considered only the genetic effects (M3; (7.3)). On average, models M4 and M5 gave a performance in terms of BS, that was 24.02 and 19.79% greater than the corresponding performance of model M3, while with model M3, on average across the 10 partitions, an improvement of 21.57 and 19.19% in terms of PCCC was obtained with regard to models M4 and M5. The difference between these last two models is only slight; the average BS value of the first model was 2.01% greater than that of the second, and the PCCC of the second model was 0.50% greater than that of the first. The R code to reproduce the results is in Appendix 3.

Table 7.3 Brier score (BS) and proportion of cases correctly classified (PCCC) across 10 random partitions, with 80% of the total data set used for training and the rest for testing, under models (7.3), (7.4), and (7.5) with a flat prior for environment effects

7.3 Ordinal Logistic Regression

As described at the beginning of this chapter, the ordinal logistic model is given in model (7.1) but with F the cumulative logistic distribution. Again, as in the ordinal probit regression model, the posterior distribution of the parameter is not easy to simulate and numerical methods are required. Here we describe the Gibbs sampler proposed by Montesinos-López et al. (2015b), which in addition to the latent variable Li in the representation of model (7.1), the parameter vectors are also augmented with a Pólya-Gamma latent random variable.

By using the following identity proposed by Polson et al. (2013):

$$ \frac{{\left[\exp \left(\eta \right)\right]}^a}{{\left[1+\exp \left(\eta \right)\right]}^b}={2}^{-b}\exp \left( k\eta \right){\int}_0^{\infty}\exp \left(-\frac{\eta^2}{2}\omega \right){f}_{\upomega}\left(\omega; b,0\right) d\omega, $$

where k = a − b/2 and fω(ω; b, 0) is the density of a Pólya-Gamma random variable ω with parameters b and d = 0 (ω ∼ PG(b, d), (see Polson et al. (2013) for details), the joint distribution of the vector of observations, Y = (Y1, …, Yn)T, and the latent variables L = (L1, …, Ln)T can be expressed as

$$ {\displaystyle \begin{array}{c}f\left(\boldsymbol{y},\boldsymbol{l}|\boldsymbol{\beta}, \boldsymbol{\gamma} \right)=\prod \limits_{i=1}^n{f}_{L_i}\left({l}_i\right){I}_{\left({\gamma}_{y_i-1},{\gamma}_{y_i}\right)}\left({l}_i\right)\\ {}=\prod \limits_{i=1}^n\frac{\exp \left(-{l}_i-{\eta}_i\right)}{{\left[1+\exp \left(-{l}_i-{\eta}_i\right)\right]}^2}{I}_{\left({\gamma}_{y_i-1},{\gamma}_{y_i}\right)}\left({l}_i\right)\\ {}\propto \prod \limits_{i=1}^n\left[{\int}_0^{\infty}\exp \left(-\frac{{\left({l}_i+{\eta}_i\right)}^2}{2}{\omega}_i\right){f}_{\omega_i}\left({\omega}_i;2,0\right)d{\omega}_i\right]{I}_{\left({\gamma}_{y_i-1},{\gamma}_{y_i}\right)}\left({l}_i\right),\end{array}} $$

where \( {\eta}_i={\boldsymbol{x}}_i^{\mathrm{T}}\boldsymbol{\beta} \) and ω1, …, ωn are independent random variables with the same Pólya-Gamma distribution with parameters b = 2 and d = 0.

Now, the Bayesian specification developed in Montesinos-López et al. (2015b) assumes the same priors as for the ordinal probit model (BRR prior model), except that now for the threshold parameter vector γ, the prior distribution proposed by Sorensen et al. (1995) is adopted, which is the distribution of the order statistics of a random sample of size C − 1 of the uniform distribution in (γL, γU), that is,

$$ f\left(\boldsymbol{\gamma} \right)=\left(C-1\right)!{\left(\frac{1}{\gamma_U-{\gamma}_L}\right)}^{C-1}{I}_{\left\{\boldsymbol{\gamma} \in {\boldsymbol{S}}_{\gamma}\right\}}, $$

where Sγ = {(γ1, …, γC − 1) : γL < γ1 < γ2 < ⋯ < γC − 1 < γU}.

Then, by conditioning on the rest of parameters, including the latent Pólya-Gamma random variables ω = (ω1, …, ωn)T, the full conditional of γc is given by

\( {\displaystyle \begin{array}{c}f\left({\gamma}_c|-\right)\propto f\left(\boldsymbol{y},\boldsymbol{l},\boldsymbol{\omega} |\boldsymbol{\beta}, \boldsymbol{\gamma} \right)f\left(\boldsymbol{\gamma} \right)\\ {}\propto \left[\prod \limits_{i=1}^n{I}_{\left({\gamma}_{y_i-1},{\gamma}_{y_i}\right)}\left({l}_i\right)\right]\left(C-1\right)!{\left(\frac{1}{\gamma_U-{\gamma}_L}\right)}^{C-1}{I}_{{\boldsymbol{S}}_{\gamma }}\left(\boldsymbol{\gamma} \right)\\ {}\propto \left[\prod \limits_{i\in \left\{i:{y}_i=c\right\}}{I}_{\left({\gamma}_{c-1},{\gamma}_c\right)}\left({l}_i\right)\prod \limits_{i\in \left\{i:{y}_i=c+1\right\}}{I}_{\left({\gamma}_c,{\gamma}_{c+1}\right)}\left({l}_i\right)\right]\left(C-1\right)!{\left(\frac{1}{\gamma_U-{\gamma}_L}\right)}^{C-1}{I}_{{\boldsymbol{S}}_{\gamma }}\left(\boldsymbol{\gamma} \right)\\ {}\propto {I}_{\left({a}_c,{b}_c\right)}\left({\gamma}_c\right)\end{array}} \)

where f(y, l, ω| β, γ) is the joint density of the observations and the vector of the latent random variables, L and ω, ac =  max {γc − 1, max{li : yi = c}}, and bc =  min {γc + 1, min{li : yi = c + 1}}. So the full conditional for the threshold parameters is also uniform, γc ∣  −  ∼ U(ac, bc), c = 1, 2…, C − 1.

Now, the conditional distribution of βj (jth element of β) is given by

$$ {\displaystyle \begin{array}{c}f\left({\beta}_j|{\sigma}_{\beta}^2,\boldsymbol{\gamma}, \boldsymbol{L},\boldsymbol{\omega} \right)\propto f\left(\boldsymbol{y},\boldsymbol{l},\boldsymbol{\omega} |\boldsymbol{\beta}, \boldsymbol{\gamma} \right)f\left(\boldsymbol{\beta} |{\sigma}_{\beta}^2\right)\\ {}\propto \exp \left[-\frac{1}{2}\sum \limits_i^n{\omega}_i\frac{{\left({l}_i+{\eta}_i\right)}^2}{2}-\frac{1}{2{\sigma}_{\beta}^2}{\beta}_j^2\right]\\ {}\propto \exp \left[-\frac{1}{2}\sum \limits_i^n{\omega}_i\frac{{\left({e}_{ij}+{x}_{ij}{\beta}_j\right)}^2}{2}-\frac{1}{2{\sigma}_{\beta}^2}{\beta}_j^2\right]\\ {}\propto \exp \left\{-\frac{1}{2{\tilde{\sigma}}_{\beta_j}^2}{\left({\beta}_j-{\tilde{\beta}}_j\right)}^2\right\},\end{array}} $$

where \( {e}_{ij}={l}_i+{\sum}_{\genfrac{}{}{0pt}{}{k=1}{k\ne j}}^p{x}_{1k}{\beta}_k \), ej = [e1j, …, enj]T, xj = [x1j, …, xnj]T, \( {\overset{\sim }{\sigma}}_{\beta_j}^2={\left({\sigma}_{\beta}^{-2}+{\sum}_{i=1}^n{\omega}_i{x}_{ij}^2\right)}^{-1} \), and \( {\overset{\sim }{\beta}}_j=-{\overset{\sim }{\sigma}}_{\beta_j}^2\left({\sum}_{i=1}^n{w}_i{x}_{ij}{e}_{ij}\right) \). From this, the full conditional distribution of βj is a normal distribution with mean \( {\overset{\sim }{\beta}}_j \) and variance \( {\overset{\sim }{\sigma}}_{\beta_j}^2. \)

The full conditional distribution of \( {\sigma}_{\beta}^2 \) is the same as the one obtained for the ordinal probit model, \( {\sigma}_{\beta}^2\mid -\sim {\chi}^{-2}\left({\overset{\sim }{v}}_{\beta },{\overset{\sim }{S}}_{\beta}\right) \) with \( {\overset{\sim }{v}}_{\beta }={v}_{\beta }+p \) and \( {\overset{\sim }{S}}_{\beta }={S}_{\beta }+{\boldsymbol{\beta}}^{\mathrm{T}}\boldsymbol{\beta} . \) In a similar fashion, for the latent variable Li, i = 1, …, n, it can be seen that its full conditional distributions are also truncated normal in \( \left({\gamma}_{y_i-1},{\gamma}_{y_i}\right) \) but with mean parameter \( -{\boldsymbol{x}}_i^{\mathrm{T}}\boldsymbol{\beta} \) and variance 1/ωi, i.e., \( {L}_i\mid -\sim N\left(-{\boldsymbol{x}}_i^{\mathrm{T}}\boldsymbol{\beta}, {\omega}_i^{-1}\right) \) truncated in \( \left({\gamma}_{y_i-1},{\gamma}_{y_i}\right) \), for each i = 1, …, n. Finally, by following Eq. (5) in Polson et al. (2013), note that the full joint conditional distribution of the Pólya random variables ω can be written as

\( {\displaystyle \begin{array}{c}f\left(\boldsymbol{\omega} |-\right)\propto f\left(\boldsymbol{y},\boldsymbol{l},\boldsymbol{\omega} |\boldsymbol{\beta}, \boldsymbol{\gamma} \right)\\ {}\propto \prod \limits_{i=1}^n\exp \left(-\frac{{\left({l}_i+{\eta}_i\right)}^2}{2}{\omega}_i\right){f}_{\upomega_{\mathrm{i}}}\left({\omega}_i;2,0\right)\\ {}\propto \prod \limits_{i=1}^n{f}_{\upomega_{\mathrm{i}}}\left({\omega}_i;2,{l}_i+{\eta}_i\right).\end{array}} \)

From here, conditionally to \( \boldsymbol{\beta}, {\sigma}_{\beta}^2,\boldsymbol{\gamma}, \) and L, ω1, …, ωn are independently Pólya-Gamma random variables with parameters b = 2 and d = li + ηi, i = 1, …, n, respectively, that is, ωi ∣  −  ∼ PG(2, li + ηi), i = 1, …, n.

With the above derived full conditionals, a Gibbs sampler exploration of this ordinal logistic regression model can be done with the following steps:

  1. 1

    Initialize the parameters: βj = βj0, j = 1, …, p, γc = γc0, c = 1, …, C − 1, li = li0, i = 1, …, n, and \( {\sigma}_{\beta}^2={\sigma}_{\beta 0}^2 \).

  2. 2.

    Simulate ω1, …, ωn independently Pólya-Gamma random variables with parameters b = 2 and d = li + ηi, i = 1, …, n.

  3. 3.

    For each i = 1, …, n, simulate li from the normal distribution \( N\left(-{\boldsymbol{x}}_i^{\mathrm{T}}\boldsymbol{\beta}, {\omega}_i^{-1}\right) \) truncated in \( \left({\gamma}_{y_i-1},{\gamma}_{y_i}\right) \).

  4. 4.

    For each j = 1, …, p, simulate from the full conditional distribution of βj, that is, from a \( N\left({\overset{\sim }{\beta}}_j,{\overset{\sim }{\sigma}}_{\beta_j}^2\right), \) where \( {\overset{\sim }{\sigma}}_{\beta_j}^2={\left({\sigma}_{\beta}^{-2}+{\sum}_{i=1}^n{\omega}_i{x}_{ij}^2\right)}^{-1} \) and \( {\overset{\sim }{\beta}}_j=-{\overset{\sim }{\sigma}}_{\beta_j}^2\left({\sum}_{i=1}^n{\omega}_i{x}_{ij}{e}_{ij}\right) \).

  5. 5.

    For each c = 1, 2…, C − 1, simulate from the full conditional distribution of γc, γc ∣  −  ∼ U(ac, bc), where ac =  max {γc − 1, max{li : yi = c}} and bc =  min {γc + 1, min{li : yi = c + 1}}.

  6. 6.

    Simulate \( {\sigma}_{\beta}^2 \) from a scaled inverse chi-squared distribution with parameters \( {\overset{\sim }{v}}_{\beta }={v}_{\beta }+p \) and \( {\overset{\sim }{S}}_{\beta }={S}_{\beta }+{\boldsymbol{\beta}}^{\mathrm{T}}\boldsymbol{\beta} \), that is, from a \( {\chi}^{-2}\left({\overset{\sim }{v}}_{\beta },{\overset{\sim }{S}}_{\beta}\right). \)

  7. 7.

    Repeat 1–6 until a burning period and a desired number of samples are reached.

Similar modifications can be done to obtain Gibbs samplers corresponding to other prior adopted models for the beta coefficients (FIXED, BayesA, BayesB, BayesC, or BL; see Chap. 6 for details of these priors). Also, for the ordinal logistic GBLUP Bayesian regression model specification as done in (7.2) for the ordinal probit model, a Gibbs sampler implementation can be done following the steps below, which can be obtained directly from the ones described above:

  1. 1.

    Initialize the parameters: bi = bi0, j = 1, …, n, γc = γc0, c = 1, …, C − 1, li = li0, i = 1, …, n, and \( {\sigma}_g^2={\sigma}_{g0}^2 \).

  2. 2.

    Simulate ω1, …, ωn independently Pólya-Gamma random variables with parameters b = 2 and d = li + bi, i = 1, …, n.

  3. 3.

    For each i = 1, …, n, simulate li from the normal distribution \( N\left(-{b}_i,{\omega}_i^{-1}\right) \) truncated in \( \left({\gamma}_{y_i-1},{\gamma}_{y_i}\right) \).

  4. 4.

    Simulate b from \( {N}_n\left(\overset{\sim }{\boldsymbol{b}},{\overset{\sim }{\boldsymbol{\Sigma}}}_b\right) \), with \( {\overset{\sim }{\boldsymbol{\Sigma}}}_b={\left({\sigma}_g^{-2}{\boldsymbol{G}}^{-1}+{\boldsymbol{D}}_{\omega}\right)}^{-1} \) and \( \overset{\sim }{\boldsymbol{b}}=-{\overset{\sim }{\boldsymbol{\Sigma}}}_b{\boldsymbol{D}}_{\omega}\boldsymbol{l} \), where Dω = Diag(ω1, …, ωn).

  5. 5.

    For each c = 1, 2…, C − 1, simulate from the full conditional distribution of γc, γc ∣  −  ∼ U(ac, bc), where ac =  max {γc − 1, max{li : yi = c}} and bc =  min {γc + 1, min{li : yi = c + 1}}.

  6. 6.

    Simulate \( {\sigma}_g^2 \) from a scaled inverse chi-squared distribution with parameters \( \tilde{v}_{g}={v}_g+n \) and \( {\overset{\sim }{S}}_g={S}_g+{\boldsymbol{b}}^{\mathrm{T}}\boldsymbol{b} \), that is, from a \( {\chi}^{-2}\left(\tilde{v}_{g},{\overset{\sim }{S}}_g\right). \)

  7. 7.

    Repeat 1–6 until a burning period and the desired number of samples are reached.

There is a lot of empirical evidence that there are no large differences between the prediction performance of the probit or logistic regression models. For this reason, here we only explained the Gibbs sampler for ordinal data under a logistic framework and did not provide illustrated examples. Also, we did not provide illustrative examples because it is not possible to implement this logistic ordinal model in BGLR.

7.4 Penalized Multinomial Logistic Regression

An extension of the logistic regression model described in Chap. 3 is the multinomial regression model that is also used to explain or predict a categorical nominal response variable (that does not have any natural order) with more than two categories. For example, a study could investigate the association of markers with diabesity (diabetes and obesity, diabetes but no obesity, obesity but no diabetes, and no diabetes and no obesity); another example could be to study the effects of age group on the histological subtypes of woman cancer (adenocarcinoma, adenosquamous, others), predict the preference in an election among four candidates (C1, C2, C3, and C4) using socioeconomic and demographic variables, etc.

The multinomial logistic regression model assumes that, conditionally to a covariate xi, a multinomial response random variable Yi takes one of the categories 1, 2, …, C, with the following probabilities:

$$ P\left({Y}_i=c|{\boldsymbol{x}}_i\right)=\frac{\exp \left({\beta}_{0c}+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_c\right)}{\sum_{l=1}^C\exp \left({\beta}_{0l}+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_l\right)},c=1,\dots, C, $$
(7.6)

where βc, c = 1, …, C, is a vector of coefficients of the same dimension as x.

Model (7.6) is not identifiable because by evaluating these probabilities with the parameter values \( \left({\beta}_{0c}^{\ast },{\boldsymbol{\beta}}_c^{\ast \mathrm{T}}\right)=\left({\beta}_{0c}+{\beta}_{0c}^{\ast \ast },{\boldsymbol{\beta}}_{0c}^{\mathrm{T}}+{\boldsymbol{\beta}}_c^{\ast \ast \mathrm{T}}\right) \), c = 1, …, C, where \( {\beta}_{0c}^{\ast \ast } \) and \( {\boldsymbol{\beta}}_c^{\ast \ast } \) are arbitrary constants and vectors, give equal probabilities than when computing this with the original parameter values \( \left({\beta}_{0c},{\boldsymbol{\beta}}_c^{\mathrm{T}}\right) \), c = 1, …, C. A common constraint that avoids this lack of identifiability is to set \( \left({\beta}_{0C},{\boldsymbol{\beta}}_C^{\mathrm{T}}\right)=\left(0,{\mathbf{0}}^{\mathrm{T}}\right) \), although any one of the other C − 1 of the vectors could be chosen.

With such a constraint (\( \left({\beta}_{0C},{\boldsymbol{\beta}}_C^{\mathrm{T}}\right)=\left(0,{\mathbf{0}}^{\mathrm{T}}\right) \)), we can identify a model by assuming that the effects of the covariate vector xi over the log “odds” of each of the categories c = 1, …, C − 1 with respect to category C (baseline category) are given (Agresti 2012) by

$$ \log \left(\frac{P\left({Y}_i=c|x\right)}{P\left({Y}_i=C|x\right)}\right)={\beta}_{0c}+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_c,\kern0.5em c=1,\dots, C-1, $$

where the effects of xi depend on the chosen response baseline category. Similar expressions can be obtained when using the unconstrained model in (7.6). From these relationships, the log odds corresponding to the probabilities of any two categories, c and l, can be derived:

$$ \log \left(\frac{P\left({Y}_i=c|x\right)}{P\left({Y}_i=l|x\right)}\right)={\beta}_{0c}-{\beta}_{0l}+{\boldsymbol{x}}_i^{\mathrm{T}}\left({\boldsymbol{\beta}}_c-{\boldsymbol{\beta}}_l\right),\kern0.75em c=1,\dots, C-1. $$

Sometimes the number of covariates is larger than the number of observations (for example, in expression arrays and genomic prediction where the number of markers is often larger than the number of phenotyped individuals), so the conventional constraint described above to force identifiability in the model is not enough and some identifiability problems remain. One way to avoid this problem is to use similar quadratic regularization maximum likelihood estimation methods (Zhu and Hastie 2004) as done for some models in Chap. 3, but here only on the slopes (βc) for the covariates, under which setting the constraints mentioned before are no longer necessary.

For a given value of the regularization parameter, λ > 0, the regularized maximum likelihood estimation of the beta coefficients, \( \boldsymbol{\beta} ={\left({\beta}_{0c},{\boldsymbol{\beta}}_c^{\mathrm{T}},{\beta}_{02},{\boldsymbol{\beta}}_2^{\mathrm{T}},\dots, {\beta}_{0C},{\boldsymbol{\beta}}_C^{\mathrm{T}}\right)}^{\mathrm{T}} \), is the value of this that maximizes the penalized log-likelihood:

$$ {\mathrm{\ell}}_p\left(\boldsymbol{\beta}; \boldsymbol{y}\right)=\mathrm{\ell}\left(\boldsymbol{\beta}; \boldsymbol{y}\right)-\lambda \sum \limits_{c=1}^C{\boldsymbol{\beta}}_c^{\mathrm{T}}{\boldsymbol{\beta}}_c, $$
(7.7)

where (β; y) =  log [L(β; y)] is the logarithm of the likelihood and takes the form:

$$ {\displaystyle \begin{array}{c}\mathrm{\ell}\left(\boldsymbol{\beta}; \boldsymbol{y}\right)=\sum \limits_{i=1}^n\log \left[P\left({Y}_i={y}_i|{\boldsymbol{x}}_i\right)\right]=\sum \limits_{i=1}^n\sum \limits_{c=1}^C{I}_{\left\{{y}_i=c\right\}}\log \left[\frac{\exp \left({\beta}_{0c}+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_c\right)}{\sum_{l=1}^C\exp \left({\beta}_{0l}+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_l\right)}\right]\\ {}=\sum \limits_{i=1}^n\sum \limits_{c=1}^C{I}_{\left\{{y}_i=c\right\}}\left({\beta}_{0c}+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_c\right)-\sum \limits_{i=1}^n\log \left[\sum \limits_{l=1}^C\exp \left({\beta}_{0l}+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_l\right)\right]\end{array}} $$
(7.8)

When p is large (p ≫ n), direct optimization of p(β; y) is almost impossible. An alternative is to use the sequential minimization optimization algorithm proposed by Zhu and Hastie (2004), which is applied after a transformation trick is used to make the involved computations feasible, because the number of parameters in the optimization is reduced to only (n + 1)C instead of (p + 1)C.

Another alternative available in the glmnet package is the one proposed by Friedman et al. (2010) that is similar to that of the logistic Ridge regression in Chap. 3. This consists of maximizing (7.7) by using a block-coordinate descent strategy, where each block is formed by the beta coefficients corresponding to each class, \( {\boldsymbol{\beta}}_c^{\ast \mathrm{T}}=\left({\beta}_{0c},{\boldsymbol{\beta}}_c^{\mathrm{T}}\right) \), but with (β; y) replaced by a quadratic approximation with respect to beta coefficients of the chosen block, \( \left({\beta}_{0c},{\boldsymbol{\beta}}_c^{\mathrm{T}}\right) \), at the current values of β (\( \overset{\sim }{\boldsymbol{\beta}} \)). That is, the update of block c is achieved by maximizing the following function with respect to β0c and βc:

$$ {f}_c\left({\beta}_{0c},{\boldsymbol{\beta}}_c\right)={\ell}_c^{\ast}\left(\boldsymbol{\beta}; \boldsymbol{y}\right)-\lambda {\boldsymbol{\beta}}_c^{\mathrm{T}}{\boldsymbol{\beta}}_c, $$
(7.9)

where \( {\ell}_c^{\ast}\left(\boldsymbol{\beta}; \boldsymbol{y}\right)=-\frac{1}{2}{\sum}_{i=1}^n{w}_{ic}{\left({y}_{ic}^{\ast }-{\beta}_{0c}-{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_c\right)}^2+\overset{\sim }{c} \) is the second-order Taylor approximation of (β; y) with respect to the beta coefficients that conform block c, \( \left({\beta}_{0c},{\boldsymbol{\beta}}_c^{\mathrm{T}}\right) \), about the current estimates \( \overset{\sim }{\boldsymbol{\beta}}={\left({\overset{\sim }{\beta}}_{0c},{\overset{\sim }{\boldsymbol{\beta}}}_c^{\mathrm{T}},{\overset{\sim }{\beta}}_{02},{\overset{\sim }{\boldsymbol{\beta}}}_2^{\mathrm{T}},\dots, {\overset{\sim }{\beta}}_{0C},{\overset{\sim }{\boldsymbol{\beta}}}_C^{\mathrm{T}}\right)}^{\mathrm{T}} \); \( {y}_{ic}^{\ast }={\overset{\sim }{\beta}}_{0c}+{\boldsymbol{x}}_i^{\mathrm{T}}{\overset{\sim }{\boldsymbol{\beta}}}_c+{w}_{ic}^{-1}\left({I}_{\left\{{y}_i=c\right\}}-\tilde{p}_{c}\left({\boldsymbol{x}}_i\right)\right) \) is the “working response,” \( {w}_{ic}=\tilde{p}_{c}\left({\boldsymbol{x}}_i\right)\left(1-\tilde{p}_{c}\left({\boldsymbol{x}}_i\right)\right) \), and \( \tilde{p}_{c}\left({\boldsymbol{x}}_i\right) \) is P(Yi = c| xi) given in (7.6) but evaluated at the current parameter values.

When X is the design matrix with standardized independent variables, the updated parameters of block c, \( {\hat{\boldsymbol{\beta}}}_c^{\ast \mathrm{T}}, \) can be obtained by the following formula:

$$ {\hat{\boldsymbol{\beta}}}_c^{\ast \mathrm{T}}={\left({\boldsymbol{X}}^{\ast \mathrm{T}}{\boldsymbol{W}}_c{\boldsymbol{X}}^{\ast }+\lambda \boldsymbol{D}\right)}^{-1}{\boldsymbol{X}}^{\ast \mathrm{T}}{\boldsymbol{W}}_c{\boldsymbol{y}}^{\ast }, $$

where X = [1n X], Wc = Diag(w1c, …, wnc), \( {\boldsymbol{y}}^{\ast}={\left({y}_1^{\ast },\dots, {y}_n^{\ast}\right)}^{\mathrm{T}} \), and D is an identity matrix of dimension (p + 1) × (p + 1) except that in the first entry we have the value of 0 instead of 1. However, in the context of p ≫ n, a non-prohibited optimization of (7.9) is achieved by using coordinate descent methods as done in the glmnet package and commented in Chap. 3.6.2.

For other penalization terms, a very similar algorithm to the one described before can be used. For example, for Lasso penalty, the penalized likelihood (7.7) is modified as

$$ {\mathrm{\ell}}_p\left(\boldsymbol{\beta}; \boldsymbol{y}\right)=\mathrm{\ell}\left(\boldsymbol{\beta}; \boldsymbol{y}\right)-\lambda \sum \limits_{c=1}^C\sum \limits_{j=1}^p\left|{\beta}_{cj}\right| $$
(7.10)

and block updating can be done as in (7.9), except that now \( {\boldsymbol{\beta}}_c^{\mathrm{T}}{\boldsymbol{\beta}}_c \) is replaced by \( {\sum}_{j=1}^p\left|{\beta}_{cj}\right| \). Like the penalized logistic regression studied in Chap. 3, the more common approach for choosing the “optimal” regularization parameter λ in the penalized multinomial regression model in (7.7) is by using a k-fold cross-validation strategy with misclassification error as metrics. This will be used here. For more details, see Friedman et al. (2010). It is important to point out that the tuning parameter λ used in glmnet is equal to the one used in the penalized log-likelihood (7.7) and (7.10) but divided by the number of observations.

7.4.1 Illustrative Examples for Multinomial Penalized Logistic Regression

Example 4

To illustrate the penalized multinomial regression model, here we considered the ordinal data of Example 1, but considering the data as nominal, which under the prediction paradigm may not be so important. To evaluate the performance of this model with Ridge (7.9) and Lasso penalization (7.10), the same 10 random partitions as used in Example 1 were used in the cross-validation strategy. By combining these two penalizations and including different covariates information, the performance of six resulting models were evaluated: penalized Ridge multinomial logistic regression model (PRMLRM) with markers as predictors (PRMLRM-1); PRMLRM with environment and markers as predictors (PRMLRM-2); PRMLRM with environment, markers, and environment×marker interaction as predictors (PRMLRM-3); and the penalized Lasso multinomial logistic regression models, PLMLRM-1, PLMLRM-2, and PLMLRM-3, which have the same predictors as PRMLRM-1, PRMLRM-2, and PRMLRM-3, respectively.

The results are shown in Table 7.4, where again metrics BS and PCCC were computed for each partition. We can observe a similar performance of all models with both metrics. The greater difference in the average BS values across the 10 partitions was found between PRMLRM-3 (worse) model and the PLMLRM-1-PLMRM-2 (best) models, given a 1.34% greater average BS value of the first one compared to the other two. With respect to the other metric, the best average PCCC value (0.29) was obtained with model PRMLRM-2, which gave a 2.36% better performance than the average value of the worse PCCC performance obtained with models PRMLRM-1-2 (0.28).

Table 7.4 Brier score (BS) and proportion of cases correctly classified (PCCC) across 10 random partitions, with 80% of the total data set used for training and the rest for testing, under model (7.6), with Ridge and Lasso penalization and different covariable information: the penalized Ridge multinomial logistic regression (PRMLR) model with markers as covariates (PRMLRM-1); the PRMLR model with environment and markers as covariates (PRMLRM-2); the PRMLR model with environment, markers, and environment×marker interaction as covariates (PRMLRM-3); PLMLRM-1, PLRMLRM-2, and PLMLRM-3 denote the penalized Lasso multinomial logistic regression model with the same covariates as in PRMLRM-1, PRMLRM-2, and PRMLRM-3, respectively

All these models gave a better performance than all models fitted in Example 1, but only a slightly better performance than the models considered in Example 2, which contain additional information besides the marker information [the only inputs included in the models in Example 1 and PRMLRM-1 and PLMLRG-1], also included environment and environment×marker interaction as predictors. Specifically, even the simpler models in Table 7.4 that use only marker information in the predictor (PRMLRM-1 and PLMLRM-1) gave results comparable to the more complex and competitive models fitted in Example 2. The relative difference in the average BS value between the better models in each of these two sets of compared models (models in Example 2 vs. models in Table 7.4) was 1.25%, while the relative difference in the average PCCC was 3.57%. The R code to reproduce the results of models under Ridge penalization is in Appendix 4. The codes corresponding to the Lasso penalization models are the same as those for Ridge penalization but with the alpha value set to 1 instead of 0, as needs to be done in the basic code of the glmnet to fit the penalized Ridge multinomial logistic regression models:

A = cv.glmnet(X, y, family='multinomial', type.measure = "class", nfolds = 10, alpha = 0)

where X and y are the design matrix and vector of response variable for training the multinomial model, and the multinomial model is specified as family='multinomial'; with "class" we specified the metric that will be used in the inner cross-validation (CV) strategy to internally select the “optimal” value of the regularization parameter λ (over a grid of 100 values of λ, by default). The misclassification error is equal to 1 − PCCC; nfolds is the number of folds that are used internally with the inner CV strategy and when this value is not specified, by default it is set to 10; alpha is a mixing parameter for a more general penalty function (Elastic Net penalty) that when it is set to 0, the Ridge penalty is obtained, while the Lasso penalty is used when alpha is set to 1; however, when alpha takes values between 0 and 1, the Elastic Net penalization is implemented.

A kind of GBLUP implementation of the simpler models (PRMLRM-1 and PLMLRM-1) considered in Example 4 can be done by considering the genomic relationship matrix derived from the marker information and using as predictor the lower triangular factor of the Cholesky decomposition or its root squared matrix. Something similar can be done to include the Env×Line interaction term. The kinds of “GBLUP” implementation of PRMLRM-1 and PLMLRM-1 are referred to as GMLRM-R1 and GMLRM-L1, respectively. In both cases, the input matrix is X = ZLLg, where ZL is the incident matrix of the lines and Lg is the lower triangular part of the Cholesky decomposition of G, \( \boldsymbol{G}={\boldsymbol{L}}_g{\boldsymbol{L}}_g^{\mathrm{T}} \). For the rest of the models, this kind of “GBLUP” implementation will be referred to as GMLRM-R2 and GMLRM-R3 for Ridge penalty, and as GMLRM-L2 and GMLRM-L3 for Lasso penalty, and in both cases the input matrix X will be X = [XE, ZLLg] when an environment and genetic effect is present in the predictor (GMLRM-R2, GMLRM-L3), or X = [XE, ZLLg, ZEL(II ⊗ Lg)] when the genotype×environment interaction effect is also taken into account (GMLRM-R2, GMLRM-L3), and where I is the number of environments.

The basic code for glmnet implementation of the GMLRM-L3 model is given below, from which the other five kinds of “GBLUP” implementations described above can be performed by only removing the corresponding predictors and changing to 0 the alpha value in the Ridge penalty:

Lg = t(chol(G)) #Environment matrix design XE = model.matrix(~Env,data=dat_F) #Matrix design of the genetic effect ZL = model.matrix(~0+GID,data=dat_F) ZLa = ZL%*%Lg #Environment-genetic interaction ZEL = model.matrix(~0+GID:Env,data=dat_F) ZELa = ZEL%*%kronecker(diag(dim(XE)[2]),Lg) X = cbind(XE[,-1],ZLa, ZELa)#Input X matrix A = cv.glmnet(X, y, family='multinomial', type.measure = "class", alpha = 1)#alpha=0 for Ridge penalty

Example 5

Here we considered the data in Example 4 to illustrate the implementation of the following six kinds of “GBLUP” models: GMLRM-R1, GMLRM-R2, GMLRM-R3, GMLRM-L1, GMLRM-L2, and GMLRM-L3. To evaluate the performance of these models, the same CV strategy was used, where for each of the 10 random partitions, 80% of the full data set was taken to train the models and the rest to evaluate their performance.

The results presented in Table 7.5 are comparable to those given in Table 7.4 for Example 4. The relative difference between the models in Table 7.4 with better average BS value (PLMLRM-1) and the model in Table 7.5 with better average BS value (GMLRM-R1) is 0.6990%, in favor of the latter model. Regarding the average PCCC, the “kind” of GBLUP model with better value, GMLRM-L2, was 2.2989% greater than the average PCCC value of the better models in Table 7.4, PMLR-R2 and PMLR-L3. Furthermore, for all models in Table 7.5, the greatest difference between the average BS values was observed in models GMLRM-1 (better) and GMLRM-R3 (worse), a relative difference of 2.748%. The best average PCCC performance was observed with model GMLRM-L2, while the worst one was with model GMLRM-L3, a relative difference of 7.23% between the best and the worst.

Table 7.5 Brier score (BS) and proportion of cases correctly classified (PCCC) across 10 random partitions, with 80% of the data used for training and the rest for testing, for multinomial model (7.6) under the six types of “GBLUP” implementation of the Ridge and Lasso penalization in Example 4 that were obtained by varying the input information: GMLRM-R1 (X = ZLLg), GMLRM-R2 (X = [XE, ZLLg]), GMLRM-R3 (X = [XE, ZLLg, ZEL(II ⊗ Lg)]), GMLRM-L1 (X = ZLLg,), GMLRM-L2 (X = [XE, ZLLg]), and GMLRM-L3 (X = [XE, ZLLg, ZEL(II ⊗ Lg)]), where the first three use Ridge penalization and the last three Lasso penalty

The R code to reproduce the results in Table 7.5 for the first three models, see Appendix 5. By only changing the value of alpha to 1, the corresponding R code allows implementing the last three Lasso penalty models.

7.5 Penalized Poisson Regression

In contrast to ordinal data, sometimes the response value is not known to be upper bound, rather it is often a count or frequency data, for example, the number of customers per minute in a specific supermarket, number of infected spikelets per plant, number of defects per unit, number of seeds per plant, etc. The most often used count model to relate a set of explanatory variables with a count response variable is Poisson regression.

Given vector covariates xi = (xi1, …, xip)T, the Poisson log-linear regression modeled the number of events Yi, as a Poisson random variable with mass density

$$ P\left({Y}_i=y|{\boldsymbol{x}}_i\right)=\frac{\lambda_i^y\exp \left(-{\lambda}_i\right)}{y!},y=0,1,2,\dots, $$
(7.11)

where the expected value of Yi is given by \( {\lambda}_i=\exp \left({\beta}_0+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_0\right) \) and β0 = (β1, …, βp)T is the vector of beta coefficients effect of the covariates. For example, Yi may be the number of defects of a product under specific production conditions (temperature, raw material, schedule, operator, etc.) or the number of spikelets per plant which can be predicted from environment and marker information, etc.

For the parameter estimation of this model, the more usual method is maximum likelihood estimation, but this is not suitable in the context of a large number of covariates, so here, as in the logistic and multinomial models, we describe the penalized likelihood method. For a Ridge penalty, the penalized log-likelihood function is given by

$$ {\displaystyle \begin{array}{c}{\mathrm{\ell}}_p\left({\beta}_0,{\boldsymbol{\beta}}_0;\boldsymbol{y}\right)=\sum \limits_{i=1}^n\log \left[{f}_{Yi}\left({y}_i;{\beta}_0,{\boldsymbol{\beta}}_0\right)\right]-\lambda \sum \limits_{j=1}^p{\beta}_j^2\\ {}=\sum \limits_{i=1}^n{y}_i\left({\beta}_0+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_0\right)-\sum \limits_{i=1}^n\exp \left({\beta}_0+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_0\right)-\sum \limits_{i=1}^n\log \left({y}_i!\right)-\frac{\lambda }{2}\sum \limits_{j=1}^p{\beta}_j^2\end{array}}. $$

This can be optimized directly by using traditional numerical methods, but in a general genomic prediction context where p ≫ n, other alternatives could be more useful. One is the same procedure described before for multinomial and logistic regressions, in which the log-likelihood is replaced by a second-order approximation at the current beta coefficients and then the resulting weighted least-squares problem is solved by coordinate descent methods. This method for Poisson regression is also implemented in the glmnet package and the basic code for implementing this is as follows:

A = cv.glmnet(X, y, family='poisson', type.measure = "mse", nfolds = 10, alpha = 0)

where again X and y are the design matrix and vector of response variable for training the Poisson model which is specified by family='poisson'; the metric that will be used internally in the inner cross-validation (CV) strategy to choose the “optimal” regularization parameter λ (over a grid of 100 values of λ, by default) is specified in "type.measure". Additionally, to the mean square error (mse), the deviance (deviance) and the mean absolute error (mae) can also be adopted. The other parameters are the same as described for logistic and multinomial regression models. For implementing the Lasso Poison regression (Poisson regression with Lasso penalty), this same code can be used by only making the argument alpha = 1, instead of 0. Further, in the genomic context, as accomplished before for other kinds of responses (continuous, binary, nominal), with this function more complex models involving marker (genetic), environment, or/and marker × environment interaction effects can also be implemented.

Like the predictors described for the multinomial model, all these predictors are shown in Table 7.6, where the first three (PRPRM-1–3) correspond to Poisson Ridge regression with marker, environment plus marker, and environment plus markers plus environment×marker interaction effects, respectively. The second group of three models (PLPRM-1–3) are, respectively, the same as before but with the Lasso penalization. The third group—three models (GRPRM-R1–R3)—are the corresponding GBLUP implementation of the first three predictors, while the last group—three models (GLPRM-L1–L3)—corresponds to the GBLUP implementation of the second group—three Lasso penalized models.

Table 7.6 Penalized Poisson regression models with Ridge or Lasso penalties and different covariate information: penalized Ridge Poisson regression (PRPRM) model with markers as covariates (PRPRM-1); PRPRM model with environment and markers as covariates (PRPRM-2); PRPRM model with environment, markers, and environment ×marker interaction as covariates (PRPRM-3); PLPRM-1, PLPRM-2, and PLPRM-3 denote the corresponding penalized Lasso Poisson regression models. GMLRM-R1–R3 are the corresponding GBLUP implementation of the PRPRM models, while GMLRM-L1–L3 are the respective GBLUP of the PLPRM models. XE, XM, and XEM are the design matrices of environment, markers and environment ×marker interaction, ZL and ZEL are the incident matrices of lines and environment ×line interaction effects, and Lg is the lower triangular part of the Cholesky decomposition of G, \( \boldsymbol{G}={\boldsymbol{L}}_g{\boldsymbol{L}}_g^T \)

Example 6

To illustrate how to fit the Poisson model with the predictors given in Table 7.6 and compare these in terms of the mean squared error of prediction, we considered a small data set that is part of the data set used in Montesinos-López et al. (2016). It consists of 50 lines in three environments and a total of 1635 markers. The performance evaluation was done using 10 random partitions where 80% of the data was used to train the model and the rest to test the model. The glmnet R package was used to train all these regression models.

The results are shown in Table 7.7, where the first six models are the first six penalized Poisson regression models given in Table 7.6. With respect to the Ridge penalized models (PRPRM-1–3), we can observe a slightly better performance of the more complex models that take into account environment, marker, and environment ×marker interaction effects (PRPRM-3). As for the penalized Lasso Poisson regression models (PLPRM-1–3), but with a more notable difference, the more complex models also resulted in better performance, and this was also better than the penalized Ridge model (PRPRM-3).

Table 7.7 Mean square error of prediction (MSE) of six penalized Poisson regression models with Ridge or Lasso penalties and different covariate information (first six models) and MSE of the GBLUP implementation of the corresponding models (second six models), across 10 random partitions, with 80% of the total data set used for training and the rest for testing. See Table 7.6 for details of each model

Furthermore, in its GBLUP implementation of the Poisson regression models with Ridge penalty, better MSE performance was obtained with the model that includes environment and genetic effects (GPRM-R2) and this was the best among all Ridge penalized models (PRPRM-1–3 and GPRM-R1–R3); the average MSE of the best PRPRM (PRPRM-3) is 7.5668 greater than the average MSE of the best GPRM with Ridge penalty (GPRM-R2). Under the GBLUP implementations of the Poisson regression models with Lasso penalty (GPRM-L1–L3), the best performance prediction was also obtained when using the environment and genetic effects (GPRM-R2), but its average MSE was slightly different (by only 2.71%) than that obtained with the average MSE of the best Lasso penalized Poisson regression model (PLPRM-3). Additionally, this GBLUP implementation (GPRM-L2) also showed the best average MSE performance among all 12 implemented models, and a notable difference with the second best (GPRM-R2) that gave a relative MSE that was 12.01% greater than that for GPRM-L2 model.

The worse average MSE performance of model GPRM-L3 is because of the high value of the MSE obtained in partition 2. However, this high variability of the MSEP observed across partitions can be unexpected for other larger data sets.

The R code to reproduce these results is shown in Appendix 6.

7.6 Final Comments

In this chapter, we give the fundamentals of some popular Bayesian and classical regression models for categorical and count data. We also provide many practical examples for implementing these models for genomic prediction. The examples used take into account in the predictor not only the effect of markers or genotypes but also illustrate how to take into account the effects of environment, genotype × environment interaction, and marker × environment interaction. Also, for each type of response variables, we calculated the prediction performance using appropriate metrics, which is very important since the metrics for evaluating the prediction performance are dependent upon the type of response variable. It is important to point out that the components to include in the predictor are not restricted to those illustrated here, since the user can include other components as main effects or interaction effects in similar fashion as illustrated in the examples provided.