Keywords

3.1 Definition of a Linear Multiple Regression Model

A linear multiple regression model (LMRM) is a useful tool for investigating linear relationships between two or more explanatory variables (inputs, features in machine learning literature) (X) and the conditional expected value of a response E(Y/X). Due to its simplicity, adequate fitting, and easily interpretable results, this has been one of the most popular techniques for studying the association between variables. Specifically, regarding the latter task, this is a useful approach and an ideal (natural) starting point for studying more advanced methods (James et al. 2013) of association and prediction.

In this chapter, we review the main concepts and approaches for fitting a linear regression model.

3.2 Fitting a Linear Multiple Regression Model via the Ordinary Least Square (OLS) Method

In a general context, we have a covariate vector X = (X1, …, Xp)T and we want to use this information to predict or explain how this variable affects a real-value response Y. The linear multiple regression model assumes a relationship given by

$$ Y={\beta}_0+\sum \limits_{j=1}^p{X}_j{\beta}_j+\epsilon, $$
(3.1)

where ϵ is a random error with mean 0, E(ϵ) = 0 and is independent of X. This error is included in the model to capture measurement errors and the effects of other unregistered explanatory variables that can help to explain the mean response. Then, the conditional mean of this model is \( E\left(Y|X\right)={\beta}_0+{\sum}_{j=1}^p{X}_j{\beta}_j \) and the conditional distribution of Y given X is only affected by the information of X.

For estimating the parameters β = (β0, β1, …,  βp)T, usually we have a set of data \( \left({\boldsymbol{x}}_{\boldsymbol{i}}^{\mathrm{T}},{y}_i\right) \), i = 1, …, n, often known as training data, where xi = (xi1, …, xip)T is a vector of features measurement and yi is the response measurement corresponding to the ith individual drawn. The most common method for estimating β is the least squares method (OLS) that consists of taking the β value that minimizes the residual sum of squares defined as

$$ \mathrm{RSS}\left(\boldsymbol{\beta} \right)=\sum \limits_{i=1}^n{\left({y}_i-{\beta}_0-{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_0\right)}^2={\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta } \right)}^{\mathrm{T}}\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta } \right), $$

where β0 = ( β1, …, βp)T, y = (y1, …, yn)T is the vector with the response values of all individuals, and X is an n × (p + 1) matrix that contains the information of the measured features of all individuals, including the intercept in the first entry:

$$ \boldsymbol{X}=\left[\begin{array}{cccc}1& {x}_{11}& \cdots & {x}_{1p}\\ {}\vdots & \vdots & \vdots & \vdots \\ {}1& {x}_{n1}& \cdots & {x}_{np}\end{array}\right]. $$

If the X matrix has full column rank, then by differentiating the residual sum of squares with respect to the β coefficients, we can find the set of β parameters that minimize the RSS(β),

$$ \frac{\mathrm{RSS}\left(\boldsymbol{\beta} \right)}{\partial \boldsymbol{\beta}}=\frac{{\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta } \right)}^{\mathrm{T}}\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta } \right)}{\partial \boldsymbol{\beta}}=\frac{{\boldsymbol{y}}^{\mathrm{T}}\boldsymbol{y}-\mathbf{2}{\boldsymbol{y}}^{\mathrm{T}}\boldsymbol{X}\boldsymbol{\beta } +{\boldsymbol{\beta}}^{\mathrm{T}}\left({\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{X}\right)\boldsymbol{\beta}}{\partial \boldsymbol{\beta}}=\mathbf{2}\left[\left({\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{X}\right)\boldsymbol{\beta} -{\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{Y}\right] $$

This derivative is also known as the gradient of the residual sum of squares. Then by setting the gradient of the residual sum of squares to zero, we obtain the normal equations

$$ \left({\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{X}\right)\boldsymbol{\beta} ={\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{Y} $$

The solution to the normal equations is unique and gives the OLS estimator of β

$$ \hat{\boldsymbol{\beta}}={\left({\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{X}\right)}^{-1}{\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{y}, $$

where super index −1 indicates the inversion matrix.

From the above assumptions, we can show that this estimator is unbiased

$$ {\displaystyle \begin{array}{c}E\left(\hat{\boldsymbol{\beta}}\right)=E\left[{\left({\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{X}\right)}^{-1}{\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{y}\right]=E\left[{\left({\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{X}\right)}^{-1}{\boldsymbol{X}}^{\mathrm{T}}\left(\boldsymbol{X}\boldsymbol{\beta } +\upepsilon \right)\right]\\ {}=E\left[{\left({\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{X}\right)}^{-1}{\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{X}\boldsymbol{\beta } \right]+{\left({\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{X}\right)}^{-1}{\boldsymbol{X}}^{\mathrm{T}}E\left(\upepsilon \right)=\boldsymbol{\beta} .\end{array}} $$

and with the additional assumption that the observation responses yis are uncorrelated and have the same variance, Var(yi) = σ2, we can also show that the variance–covariance matrix of this is

$$ \mathrm{Var}\left(\hat{\boldsymbol{\beta}}\right)={\sigma}^2{\left({\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{X}\right)}^{-1}. $$

When the input features only contain the information of a variable (p = 1), the resulting model is known as simple linear regression and can be easily visualized in the Cartesian plane. When p = 2, the above multiple linear regression describes a plane in the three-dimensional space (x1, x2, y). In general, the conditional expected value of this model defines a hyperplane in the p-dimensional space of the input variables (Montgomery et al. 2012).

The fitted values corresponding to all the training individuals are

$$ \hat{\boldsymbol{y}}=\boldsymbol{X}\hat{\boldsymbol{\beta}}=\boldsymbol{X}{\left({\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{X}\right)}^{-1}{\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{y}=\boldsymbol{Hy}, $$

where the matrix H = X(XTX)−1XT is commonly called the hat matrix. This is because the vector of the observed response values is mapped by this expression to a vector of fitted values (Montgomery et al. 2012), in this way, puts the hat on y (Hastie et al. 2009). In a similar way, a predicted value of an arbitrary individual with feature x can be obtained by

$$ {\hat{\boldsymbol{y}}}^{\ast }={\boldsymbol{x}}^{\ast \mathrm{T}}\hat{\boldsymbol{\beta}}, $$

where x = (1, xT)T.

An unbiased estimator for the common residual variance σ2 is obtained by

$$ {\displaystyle \begin{array}{c}{\hat{\sigma}}^2=\frac{1}{n-p-1}\sum \limits_{i=1}^n{\left({y}_i-{\hat{y}}_i\right)}^2\\ {}=\frac{1}{n-p-1}\sum \limits_{i=1}^n{e}_i^2\\ {}=\frac{1}{n-p-1}{\boldsymbol{e}}^{\mathrm{T}}\boldsymbol{e}\\ {}=\frac{1}{n-p-1}{\left(\boldsymbol{y}-\hat{\boldsymbol{y}}\right)}^{\mathrm{T}}\left(\boldsymbol{y}-\hat{\boldsymbol{y}}\right)\\ {}=\frac{1}{n-p-1}{\boldsymbol{y}}^{\mathrm{T}}\left({\boldsymbol{I}}_n-\boldsymbol{H}\right)\boldsymbol{y},\end{array}} $$

where \( {e}_i={y}_i-{\hat{y}}_i \) is known as the residual of the model corresponding to the individual i, \( \boldsymbol{e}=\boldsymbol{y}-\hat{\boldsymbol{y}} \) is the vector of all residual values, and In is the identity matrix of order n × n.

The traditional inferential and prediction analysis for this model assumes that the random error ϵ is normally distributed with mean zero and variance σ2. With this we can show that the OLS of beta coefficients, \( \hat{\boldsymbol{\beta}} \), is a random vector distributed according to a multivariate normal distribution with vector mean β and a variance–covariance matrix, as previously defined (Montgomery et al. 2012; Hastie et al. 2009; Rencher and Schaalje 2008). Another important fact that will be described in more detail in the next section, is that under the Gaussian assumption over errors, the OLS of β coincides with the maximum likelihood estimator.

We can also show that \( \left(n-p-1\right){\hat{\sigma}}^2/{\sigma}^2 \) is independent of \( \hat{\boldsymbol{\beta}} \) and distributed according to a Chi-squared distribution with n − p − 1 degrees of freedom. Based on this and on the properties of the normal and t-student distributions, we show that for each j = 0, …, p, \( {T}_j=\left({\hat{\beta}}_j-{\beta}_j\right)/\sqrt{c_{j,j}{\hat{\sigma}}^2} \), where cjj is the (j + 1, j + 1) elements of the matrix (XTX)−1, are random variables with a t-student distribution with n − p − 1 degrees of freedom (tn − p − 1). That is, Tj ∼ tn − p − 1 and ∼ stands for distributed as. From here, a 100(1 − α)% confidence interval for a particular beta coefficient, βj, is given by

$$ {\hat{\beta}}_j\pm {t}_{1-\alpha /2,\kern0.75em n-p-1}\sqrt{c_{jj}{\hat{\sigma}}^2}, $$

where tαn − p − 1 is the α quantile of the t-student distribution with n − p − 1 degrees of freedom. Similarly, a 100(1 − α)% joint confidence region for all the beta coefficients, β, is given if these values satisfy

$$ \frac{{\left(\hat{\boldsymbol{\beta}}-\boldsymbol{\beta} \right)}^{\mathrm{T}}{\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{X}\left(\hat{\boldsymbol{\beta}}-\boldsymbol{\beta} \right)}{\left(p+1\right){\hat{\sigma}}^2}\le {F}_{1-\alpha, \kern0.5em n-p-1}^{p+1}, $$

where \( {F}_{\alpha, n-p-1}^{p+1} \) denotes the α quantile of the F distribution with p + 1 and n − p − 1 degrees of freedom in the numerator and denominator, respectively (Rencher and Schaalje 2008).

In a similar way, to test a hypothesis over a specific beta coefficient, H0j = βj = βj0, the following rule can be used: reject H0j if \( {T}_{j0}=\left({\hat{\beta}}_j-{\beta}_{j0}\right)/\sqrt{c_{j,j}{\hat{\sigma}}^2} \) is “large” in magnitude, that is, if |Tj0| > t1 − α/2, n − p − 1, where α is the desired level test. More generally, the test H0 = Wβ = w, where W is a q × (p + 1) matrix of rank q ≤ p + 1, can be performed using the following rule:

$$ \mathrm{reject}\ {H}_0\ \mathrm{if}\ F=\frac{n-p-1}{q}\frac{{\left(\boldsymbol{W}\boldsymbol{\beta } -\boldsymbol{w}\right)}^{\mathrm{T}}{\left[\boldsymbol{W}{\left({\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{X}\right)}^{-1}{\boldsymbol{W}}^{\mathrm{T}}\right]}^{-1}\left(\boldsymbol{W}\boldsymbol{\beta } -\boldsymbol{w}\right)}{{\hat{\sigma}}^2}\ge {F}_{1-\alpha, n-p-1}^{p+1}. $$

3.3 Fitting the Linear Multiple Regression Model via the Maximum Likelihood (ML) Method

The maximum likelihood (ML) estimation is a more general and popular method for estimating the parameters of a model (Casella and Berger 2002). It consists of finding the parameter value that maximizes the “probability” of observed values in the sample under the adopted model. Specifically, if \( \left({\boldsymbol{x}}_{\boldsymbol{i}}^{\mathrm{T}},{y}_i\right) \), i = 1, …, n, is a set of observations from a multiple linear regression model (3.1) with homoscedastic and uncorrelated errors, the MLE of β and σ2, \( \hat{\boldsymbol{\beta}} \) and \( {\hat{\sigma}}^2 \), of this model is defined as

$$ \left({\hat{\boldsymbol{\beta}}}^{\mathrm{T}},{\hat{\sigma}}^2\right)=\underset{\boldsymbol{\beta}, {\sigma}^2}{\arg\ \max }L\left(\boldsymbol{\beta}, {\sigma}^2;\boldsymbol{y},\boldsymbol{X}\right), $$

where L(β, σ2; y, X) is the likelihood function of the parameters, which is the probability of the observed response values but viewed as a function of the parameters

$$ L\left(\boldsymbol{\beta}, {\sigma}^2;\boldsymbol{y},\boldsymbol{X}\right)={\left(\frac{1}{\sqrt{2\pi {\sigma}^2}}\right)}^n\exp \left[-\frac{1}{2{\sigma}^2}{\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta } \right)}^{\mathrm{T}}\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta } \right)\right]. $$

Then, the log(L(β, σ2; y, X)) is equal to

$$ \log \left(L\left(\boldsymbol{\beta}, {\sigma}^2;\boldsymbol{y},\boldsymbol{X}\right)\ \right)=-\frac{n}{2}\log \left(2\pi \right)-n\log \left(\sigma \right)-\frac{1}{2{\sigma}^2}{\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta } \right)}^{\mathrm{T}}\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta } \right) $$

To find the maximum of σ2 and β, we get the derivative of \( \log \left(L\left(\hat{\boldsymbol{\beta}},{\sigma}^2;\boldsymbol{y},\boldsymbol{X}\right)\ \right) \) with regard to these parameters

$$ \frac{\log \left(L\left(\boldsymbol{\beta}, {\sigma}^2;\boldsymbol{y},\boldsymbol{X}\right)\ \right)}{\partial \boldsymbol{\beta}}=\frac{\left[\left({\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{X}\right)\boldsymbol{\beta} -{\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{Y}\right]}{\sigma^2} $$
$$ \frac{\log \left(L\left(\boldsymbol{\beta}, {\sigma}^2;\boldsymbol{y},\boldsymbol{X}\right)\ \right)}{\partial {\sigma}^2}=-\frac{n}{2{\sigma}^2}+\frac{1}{2{\sigma}^4}{\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta } \right)}^{\mathrm{T}}\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta } \right) $$

Now, by setting these derivatives equal to zero and solving the resulting equations for β and σ2, we found that the estimates of these parameters are

$$ \hat{\boldsymbol{\beta}}={\left({\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{X}\right)}^{-1}{\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{y} $$
$$ {\hat{\sigma}}^2=\frac{1}{n}{\left(\boldsymbol{y}-\boldsymbol{X}\hat{\boldsymbol{\beta}}\right)}^{\mathrm{T}}\left(\boldsymbol{y}-\boldsymbol{X}\hat{\boldsymbol{\beta}}\right). $$

From this we can see that for each value of σ2, the value of β that maximizes the likelihood is the same value that maximizes \( -\frac{1}{2{\sigma}^2}{\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta } \right)}^{\mathrm{T}}\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\beta } \right) \), which in turn minimizes (y − )T(y − ), which is precisely the OLS of β, \( \hat{\boldsymbol{\beta}} \). But when equating the derivative of \( \log \left(L\left(\hat{\boldsymbol{\beta}},{\sigma}^2;\boldsymbol{y},\boldsymbol{X}\right)\ \right) \) to zero and solving for σ2, the value of σ2 that maximizes \( L\left(\hat{\boldsymbol{\beta}},{\sigma}^2;\boldsymbol{y},\boldsymbol{X}\right) \) is \( {\hat{\sigma}}^2=\frac{1}{n}{\left(\boldsymbol{y}-\boldsymbol{X}\hat{\boldsymbol{\beta}}\right)}^{\mathrm{T}}\left(\boldsymbol{y}-\boldsymbol{X}\hat{\boldsymbol{\beta}}\right). \)

Finally,

$$ L\left(\boldsymbol{\beta}, {\sigma}^2;\boldsymbol{y},\boldsymbol{X}\right)\le L\left(\hat{\boldsymbol{\beta}},{\sigma}^2;\boldsymbol{y},\boldsymbol{X}\right)\le L\left(\hat{\boldsymbol{\beta}},{\hat{\sigma}}^2;\boldsymbol{y},\boldsymbol{X}\right) $$

and from here, the MLE of β and σ2 are \( \hat{\boldsymbol{\beta}} \) and \( {\hat{\sigma}}^2 \), because it can be shown that the values of parameters that maximize the likelihood are unique when the design matrix X is of full column rank.

3.4 Fitting the Linear Multiple Regression Model via the Gradient Descent (GD) Method

The steepest descent method, also known as the gradient descent (GD) method, is a first-order iterative algorithm for minimizing a function (f). It is a central mechanism in statistical learning to training models (to estimate the parameters), for example, in neuronal networks and penalized regression models (Ridge and Lasso). It consists of successively updating the argument of the objective function in the direction of the steepest descent (along the negative of the gradient of the function), that is, in the direction in which f decreases most rapidly (Haykin 2009; Nocedal and Wright 2006). Specifically, each step of this algorithm is described by

$$ {\eta}_{t+1}={\eta}_t-\alpha \nabla f\left({\eta}_t\right), $$

where ∇f(ηt) is the gradient vector of f evaluated in the current value ηt and α is a step size or learning rate parameter, which greatly determines the convergence behavior toward an optimal solution (Haykin 2009; Beysolow II 2017) and in neural networks it is popular for setting this at a small, fixed value (Warner and Misra 1996; Goodfellow et al. 2016). The learning rate parameter can be adaptative as well, that is, can be allowed to change at each step. For example, in the library Keras (see Chap. 11) that can be used for implementing and training neuronal networks models, there are several optimizers based on an adaptive gradient descendent algorithm such as Adam Adgrad, Adadelta, RMSprop, among others (Allaire and Chollet 2019). The ideal value of the step size would be the value that gives the larger reduction in each step, that is, the value of α that minimizes f(ηt − α ∇ f(ηt)), which in general is difficult and expensive to obtain (Nocedal and Wright 2006).

Although the use of this algorithm could be avoided in an MLR, especially in small data sets, and also because of its slow convergence in linear systems (Burden and Faires 2011), here we will describe how this works when finding the optimal beta coefficients in this model. First, the gradient of the residual sum of squares is given by

$$ \nabla \mathrm{RSS}\left(\boldsymbol{\beta} \right)=2\left({\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{X}\boldsymbol{\beta } -{\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{y}\right). $$

Then, the next update of beta coefficients in the gradient descent algorithm in this model is given by

$$ {\displaystyle \begin{array}{c}{\boldsymbol{\beta}}_{t+1}={\boldsymbol{\beta}}_t-2\alpha \left({\boldsymbol{X}}^{\mathrm{T}}{\boldsymbol{X}\boldsymbol{\beta}}_t-{\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{y}\right)\\ {}={\boldsymbol{\beta}}_t-2\alpha {\boldsymbol{X}}^{\mathrm{T}}\left({\boldsymbol{X}\boldsymbol{\beta}}_t-\boldsymbol{y}\right)\\ {}={\boldsymbol{\beta}}_t+2\alpha {\boldsymbol{X}}^{\mathrm{T}}{\boldsymbol{e}}_t,\end{array}} $$

where et = y − Xβt is the vector of residuals that is obtained in the current iteration. One way to speed up the convergence of the algorithm is by choosing the ideal learning rate in each step, which, as was described before, is given by the value of α that minimizes f(ηt − α ∇ f(ηt)), and in this case for the MLR model is given by (Nocedal and Wright 2006):

$$ {\alpha}_t=\frac{{\boldsymbol{e}}_t^{\mathrm{T}}\boldsymbol{X}{\boldsymbol{X}}^{\mathrm{T}}{\boldsymbol{e}}_t}{{\boldsymbol{e}}_t^{\mathrm{T}}{\left(\boldsymbol{X}{\boldsymbol{X}}^{\mathrm{T}}\right)}^2{\boldsymbol{e}}_t}. $$

Example 1

For numerical illustration, we considered a synthetic data set that consists of 100 observations and two covariates. The scatter plots in Fig. 3.1 show how the response variable (y) is related to the two covariates (x1, x2). By setting a value of 10−2 for the learning rate parameter, and as the stopping criterion a tolerance of 10−8 for the maximum norm of the difference between the current and next vector value, the beta coefficient obtained with the GD method is \( \hat{\boldsymbol{\beta}}=\left(5.0460764,0.8551383,2.1903356\right) \). For these synthetic examples, 12 iterations were necessary, while by changing the learning rate parameter to 10−3, the number of iterations increased to 185, but we practically got the same results. Now, by using the “optimal” learning rate parameter described before for MLR with the same tolerance error (10−8), the number of required iterations up to convergence is reduced to only 10 iterations. In general, the performance of the gradient descent depends greatly on the objective function and can be affected by the characteristics of the model, the dispersion of the data (explained variance of the predictors), and the dependence between the predictors, among others.

Fig. 3.1
figure 1

Scatter plot of synthetic data generated from an MLR with two covariates

In the data set used in this example, the covariates are independent and the proportion of explained variances by the predictor is about 79% of the total variance of the response. By changing to a pair of moderately correlated covariates with correlation 0.75, while holding the same beta coefficient values, the variance of the residual (1.44), and the same sample size, we generate data where a greater proportion of variance is explained by the covariates (85.6%), but when applying the gradient descent with the same tolerance error (10−8) as before, and learning rate values of 10−2 and 10−3, the required number of iterations are about 5.75 times (69) and 3.5 times (649) the number required for the independent covariates case and the example described before, respectively.

Continuing with the last case of dependent variables, when using the optimal learning rate described before for the MLR, the number of iterations is reduced to 60, 9 less than when using the constant learning rate 10−2.

By multiplying the beta coefficients used before by sqrt(0.1), the proportion of explained variance by the covariates is reduced to 27.30% and 37.2% in the same independent covariate (E3) and the same correlated covariate (E4) scenarios described before, respectively. With a tolerance error of 10−8 and with a learning rate equal to 10−2, the required number of iterations are 183 and 66, for scenarios E3 and E4, respectively, while for a learning rate of 10−3 the required number of iterations are 617 and 1638. When using the “optimal” learning rate parameter, the required number of iterations is reduced to 17 and 56 for scenarios E3 and E4, respectively.

The R code used for implementing the GD method is given next.

#################R code for Example 1 ############################### rm(list=ls()) library(mvtnorm) set.seed(1) X = cbind(1,rmvnorm(100,c(0,0),diag(2))) #Uncomment the next three lines code to simulate dependent covariables #Sigma_X =0.75+0.25*diag(2) #L = t(chol(Sigma_X)) #X = X%*%t(L) betav = c(5,1,2.1) #Uncomment the next line code to reduce the value of the beta cofficients and reduce the proportion of variance of the response explained by the features betav = sqrt(0.1)*beta y = X%*%betav + rnorm(100,0,1.2) dat = data.frame(y=y,x1 = X[,2],x2=X[,3]) plot(dat) alpha = 1e-2 #alpha = 1e-3 tol = 1e-8 p = 2 betav_0 = c(mean(y),rep(0,p)) tol.e = 1 Iter = 0 tX = t(X) XtX = X%*%t(X) while(tol<tol.e) { Iter = Iter + 1 e = y-X%*%betav_0 #Uncomment the next line code to use the optimal learning rate #alpha = (t(e)%*%XtX%*%e/(t(e)%*%(XtX)%*%XtX%*%e))[1,1] betav_t = betav_0 + alpha*tX%*%e tol.e = max(abs(betav_t-betav_0)) betav_0 = betav_t } betav_t tol.e Iter

This code is only for illustrative purposes, that is, to illustrate in a very transparent way how the GD method can be implemented. Of course the existing statistical machine learning software programs implement this method and so there is no need to use this program for real applications, since the existing software programs that implement this method do a lot of work more efficiently and in a more user-friendly way.

3.5 Advantages and Disadvantages of Standard Linear Regression Models (OLS and MLR)

The MLR is a simple and computationally appealing class of models, but with many predictors (relative to the sample size) or nearly dependent features, it may result in large prediction error and/or large predictive intervals (Wakefield 2013). To appreciate the latter case (nearly dependent features), consider the spectral decomposition XTX = ΓΛΓT, where Λ = Diag(λ0, …, λp) is a diagonal matrix with the eigenvalues of XTX in decreasing order and Γ is an orthogonal matrix with columns corresponding to eigenvectors of XTX. Then the obtained variance–covariance matrix of the OLS estimator of \( \hat{\boldsymbol{\beta}} \) can be expressed as

$$ \mathrm{Var}\left(\hat{\boldsymbol{\beta}}\right)={\sigma}^2{\left(\boldsymbol{\Gamma} \boldsymbol{\Lambda} {\boldsymbol{\Gamma}}^{\mathrm{T}}\right)}^{-1}={\sigma}^2\boldsymbol{\Gamma} {\boldsymbol{\Lambda}}^{-1}{\boldsymbol{\Gamma}}^{\mathrm{T}}. $$

When the features are nearly dependent, some λj’s will be “close” to zero and consequently the variance of some \( {{\hat{\beta}}_j}^{\prime}\mathrm{s} \) will be high; this is even greater when the linear dependence of the features is strong (Wakefield 2013; Christensen 2011). This strong dependence between features is a problem of the OLS in MLR that is also reflected in the quality of the prediction performance, for example, when this is measured by the conditional expected prediction error (EPE) or mean squared error prediction that for an individual with feature xo is given by

$$ {\displaystyle \begin{array}{c}\mathrm{EPE}\left({\boldsymbol{x}}_o\right)={E}_{\boldsymbol{Y},{Y}_o\mid \boldsymbol{X},{\boldsymbol{x}}_o}\left[{\left({Y}_o-{\boldsymbol{x}}_o^{\ast \mathrm{T}}\hat{\boldsymbol{\beta}}\right)}^2\right]\\ {}={E}_{\boldsymbol{Y},{Y}_o\mid \boldsymbol{X},{\boldsymbol{x}}_o}\left\{{\left[\left({Y}_o-E\left({Y}_0|{\boldsymbol{x}}_0\right)+E\left({Y}_0|{\boldsymbol{x}}_0\right)-{\boldsymbol{x}}_o^{\ast \mathrm{T}}\hat{\boldsymbol{\beta}}\right)\right]}^2\right\}\\ {}={E}_{\boldsymbol{Y},{Y}_o\mid \boldsymbol{X},{\boldsymbol{x}}_o}\left[{\left({Y}_o-E\left({Y}_0|{\boldsymbol{x}}_0^{\ast \mathrm{T}}\right)\right)}^2\right]+2{E}_{Y_o\mid {\boldsymbol{x}}_o}\left[\left({Y}_o-E\left({Y}_0|{\boldsymbol{x}}_0^{\ast \mathrm{T}}\right)\right)\right]\left[E\left({Y}_0|{\boldsymbol{x}}_0^{\ast \mathrm{T}}\right)-{E}_{\boldsymbol{Y}\mid \boldsymbol{X}}\left({\boldsymbol{x}}_o^{\ast \mathrm{T}}\hat{\boldsymbol{\beta}}\right)\right]\\ {}+{E}_{\boldsymbol{Y},{Y}_o\mid \boldsymbol{X}}\left[{\left(E\left({Y}_0|{\boldsymbol{x}}_0^{\ast \mathrm{T}}\right)-{\boldsymbol{x}}_o^{\ast \mathrm{T}}\hat{\boldsymbol{\beta}}\right)}^2\right]\\ {}={\sigma}^2+{E}_{\boldsymbol{Y},{Y}_o\mid \boldsymbol{X},{\boldsymbol{x}}_o}\left[{\left({\boldsymbol{x}}_o^{\ast \mathrm{T}}\boldsymbol{\beta} -{\boldsymbol{x}}_o^{\ast \mathrm{T}}\hat{\boldsymbol{\beta}}\right)}^2\right]={\sigma}^2+ Var\left({\boldsymbol{x}}_o^{\ast \mathrm{T}}\hat{\boldsymbol{\beta}}|{\boldsymbol{x}}_{\boldsymbol{o}}\right)\\ {}={\sigma}^2+{\sigma}^2{\boldsymbol{x}}_o^{\ast \mathrm{T}}{\boldsymbol{\Gamma} \boldsymbol{\Lambda}}^{-1}{\boldsymbol{\Gamma}}^{\mathrm{T}}{\boldsymbol{x}}_o^{\ast}\\ {}={\sigma}^2\left(1+\sum \limits_{j=0}^p\frac{{\left({x}_{oj}^{\ast \ast}\right)}^2}{\lambda_j}\right),\end{array}} $$

where \( {\boldsymbol{x}}_o^{\ast \ast }={\boldsymbol{\Gamma}}^{\mathrm{T}}{\boldsymbol{x}}_o^{\ast}={\left({x}_{o0}^{\ast \ast },\dots, {x}_{op}^{\ast \ast}\right)}^{\mathrm{T}}. \) This means that the average loss incurred (squared difference between the value to be predicted and the predicted value) by predicting Y0 with its estimated mean under the MLR, \( {\boldsymbol{x}}_o^{\ast \mathrm{T}}\hat{\boldsymbol{\beta}} \), is composed of intrinsic or irreducible data noise (first term) and the variance of \( {\boldsymbol{x}}_o^{\ast \mathrm{T}}\hat{\boldsymbol{\beta}} \) (second term). The former cannot be avoided no matter how well the mean value of Y0 ∣ x0, E(Y0| x0), is estimated, and the latter increases as the dependence of features is stronger. From this, it is apparent that the EPE is also affected by the strong dependence between features, which is a problem of the OLS in an MLR in a prediction context.

3.6 Regularized Linear Multiple Regression Model

3.6.1 Ridge Regression

Ridge regression, originally proposed as a method to combat multicollinearity, is also a common approach for controlling overfitting in an MLR model (Christensen 2011). It translates the OLS problem into the minimization of the penalized residual sum of squares defined as

$$ \mathrm{PRS}{\mathrm{S}}_{\lambda}\left(\boldsymbol{\beta} \right)=\sum \limits_{i=1}^n{\left({y}_i-{\beta}_0-\sum \limits_{j=1}^p{x}_{ij}{\beta}_j\right)}^2+\lambda \sum \limits_{j=1}^p{\beta}_j^2, $$

where λ ≥ 0 is known as the regularization or tuning parameter, which determines the level or degree to which the beta coefficients are shrunk toward zero. When λ = 0, the OLS is the solution to the beta coefficients, but when λ is large, the PRSSλ(β) is dominated by the penalization term, and the OLS solution has to shrink toward 0 (Christensen 2011). In general, when the number of parameters to be estimated is larger than the number of observations, the estimator can be highly variable. In this situation, the intuition of Ridge regression tries to alleviate this by constraining the sum of squares for the beta coefficients.

Note that PRSSλ(β) can be expressed as

$$ \mathrm{PRS}{\mathrm{S}}_{\lambda}\left(\boldsymbol{\beta} \right)=\mathrm{RSS}\left(\boldsymbol{\beta} \right)+\lambda {\boldsymbol{\beta}}^{\mathrm{T}}\boldsymbol{D}\boldsymbol{\beta }, $$

where D =  diag (0, 1, …, 1) is an identity matrix of dimension (p + 1) × (p + 1) but with one zero in its first entry. Then, the gradient of RSSλ(β), that is, the first derivative with regard to β of RSSλ(β), is

$$ \nabla \mathrm{PRS}{\mathrm{S}}_{\lambda}\left(\boldsymbol{\beta} \right)=2\left({\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{X}\boldsymbol{\beta } -{\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{y}\right)+2\lambda \boldsymbol{D}\boldsymbol{\beta } . $$

Solving ∇PRSSλ(β) = 0, the Ridge solution is given by

$$ {\hat{\boldsymbol{\beta}}}^R\left(\lambda \right)=\underset{\boldsymbol{\beta}}{\mathrm{argmin}}\mathrm{PRS}{\mathrm{S}}_{\lambda}\left(\boldsymbol{\beta} \right)={\left({\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{X}+\lambda \boldsymbol{D}\right)}^{-1}{\mathbf{X}}^{\mathrm{T}}\boldsymbol{y}. $$

This is a biased estimator of β because the conditional expected value is given by

$$ E\left[{\hat{\boldsymbol{\beta}}}^R\left(\lambda \right)\right]={\left({\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{X}+\lambda \boldsymbol{D}\right)}^{-1}{\mathbf{X}}^{\mathrm{T}}\boldsymbol{X}\boldsymbol{\beta } $$

but as will be described later, relative to the OLS estimator, by introducing a “small” bias, the variance or/and the EPE of this method could potentially be reduced (Wakefield 2013).

By using the method of Lagrange multipliers, the Ridge regression estimates of the β coefficients can be reformulated in a similar way to the OLS problem, but subject to the condition that the magnitude of the β0 = (β1, …, βp)T be less or equal to \( t{\left(\lambda \right)}^{\frac{1}{2}} \), that is,

$$ {\displaystyle \begin{array}{l}{\hat{\boldsymbol{\beta}}}^R\left(\lambda \right)=\underset{\boldsymbol{\beta}}{\mathrm{argmin}}\mathrm{RSS}\left(\boldsymbol{\beta} \right)\\ {}\mathrm{subject}\ \mathrm{to}\ \sum \limits_{j=1}^p{\beta}_j^2\le t\left(\lambda \right),\end{array}} $$

where t(λ) is a one-to-one function that produces an equivalent definition to the penalized OLS presentation of the Ridge regression described before (Wakefield 2013; Hastie et al. 2009, 2015). This constrained reformulation gives a more transparent role than the one played by the tuning parameter, and among other things, suggests a convenient and common way of redefining the Ridge estimator by standardizing the variables when these are of very different scales.

A graphic representation of this constraint problem for β0 = 0 and p = 2 is given in Fig. 3.2, where the nested ellipsoids correspond to contour plots of RSS(β) and the green region is the restriction with t(λ) = 32, which contains the Ridge solution.

Fig. 3.2
figure 2

Graphic representation of the Ridge solution of the OLS with restriction \( {\sum}_{j=1}^p{\beta}_j^2<{3}^2 \). The green region contains the Ridge solution for t(λ) = 32

The MLR defined in (3.1) but now defined with the standardized variables is expressed as

$$ {\displaystyle \begin{array}{c}y={1}_n\mu +{\boldsymbol{X}}_{1s}{\boldsymbol{\beta}}_{0s}+\upepsilon \\ {}={\boldsymbol{X}}_s{\boldsymbol{\beta}}_s+\upepsilon, \end{array}} $$

where 1n is the column vector with 1’s in all its entries, \( {\boldsymbol{X}}_{1s}=\left[\begin{array}{ccc}{x}_{11s}& \cdots & {x}_{1 ps}\\ {}\vdots & \vdots & \vdots \\ {}{x}_{n1s}& \cdots & {x}_{nps}\end{array}\right], \) \( {x}_{ij s}=\left({x}_{ij}-{\overline{x}}_j\right)/{s}_j \), \( {s}_j=\sqrt{\sum \limits_{i=1}^n{\left({x}_{ij}-{\overline{x}}_j\right)}^2/n} \), j = 1, …, p; Xs = [1n X1s]; \( {\boldsymbol{\beta}}_s={\left(\mu, \kern0.5em {\boldsymbol{\beta}}_{0s}^{\mathrm{T}}\right)}^{\mathrm{T}}; \) and β0s = (β1s, …, βps)T.

Then, the redefined penalized residual sum squared under this model is

$$ {\displaystyle \begin{array}{c}{\mathrm{PRSS}}_{\lambda}\left({\boldsymbol{\beta}}_s\right)=\sum \limits_{i=1}^n{\left({y}_i-\mu -\sum \limits_{j=1}^p{x}_{ijs}{\beta}_{js}\right)}^2+\lambda \sum \limits_{j=1}^p{\beta}_{js}^2\\ {}={\left(\boldsymbol{y}-{\boldsymbol{X}}_s^{\mathrm{T}}{\boldsymbol{\beta}}_s\right)}^{\mathrm{T}}\left(\boldsymbol{y}-{\boldsymbol{X}}_s^{\mathrm{T}}{\boldsymbol{\beta}}_s\right)+\lambda {{\boldsymbol{\beta}}_{0s}}^{\mathrm{T}}{\boldsymbol{D}\boldsymbol{\beta}}_{0s}.\end{array}} $$

The Ridge solution under this redefinition is like the one given before, but now

$$ {\displaystyle \begin{array}{c}{\hat{\boldsymbol{\beta}}}_s^R\left(\lambda \right)={\left({\boldsymbol{X}}_s^{\mathrm{T}}{\boldsymbol{X}}_s+\boldsymbol{\lambda} \boldsymbol{D}\right)}^{-1}{\boldsymbol{X}}_s^{\mathrm{T}}\boldsymbol{y}\\ {}={\left(\left[\begin{array}{c}{1}_n^{\mathrm{T}}\\ {}{\boldsymbol{X}}_{1s}^{\mathrm{T}}\end{array}\right]\left[{1}_n\ {\boldsymbol{X}}_{1s}\right]+\lambda \boldsymbol{D}\right)}^{-1}\left[\begin{array}{c}{1}_n^{\mathrm{T}}\\ {}{\boldsymbol{X}}_{1s}^{\mathrm{T}}\end{array}\right]\boldsymbol{y}\\ {}={\left[\begin{array}{cc}n& {0}_n^{\mathrm{T}}\\ {}{0}_n& {\boldsymbol{X}}_{1s}^{\mathrm{T}}{\boldsymbol{X}}_{1s}+\lambda {\boldsymbol{I}}_{\boldsymbol{p}}\end{array}\right]}^{-1}\left[\begin{array}{c}{1}_n^{\mathrm{T}}y\\ {}{\boldsymbol{X}}_{1s}^{\mathrm{T}}y\end{array}\right]\\ {}=\left[\begin{array}{c}{\overline{y}}_n\\ {}{\hat{\boldsymbol{\beta}}}_{0s}\left(\lambda \right)\ \end{array}\right],\end{array}} $$

where \( {\overline{y}}_n={\sum}_{i=1}^n{y}_i/n \) is the sample mean of the responses and \( {\hat{\boldsymbol{\beta}}}_{0s}\left(\lambda \right)={\left({\boldsymbol{X}}_{1s}^{\mathrm{T}}{\boldsymbol{X}}_{1s}+\lambda {\boldsymbol{I}}_{\boldsymbol{p}}\right)}^{-1}{\boldsymbol{X}}_{1s}^{\mathrm{T}}\boldsymbol{y} \) is the Ridge estimator of β0s. The mean value of this Ridge solution is

$$ \mathrm{E}\left[{\hat{\boldsymbol{\beta}}}_s^R\left(\lambda \right)\right]=\left[\begin{array}{c}\mu \\ {}E\left[{\hat{\boldsymbol{\beta}}}_{0s}\left(\lambda \right)\right]\ \end{array}\right], $$

where \( E\left[{\hat{\boldsymbol{\beta}}}_{0s}\left(\lambda \right)\right]={\left({\boldsymbol{X}}_{1s}^{\mathrm{T}}{\boldsymbol{X}}_{1s}+\lambda {\boldsymbol{I}}_{\boldsymbol{p}}\right)}^{-1}{\boldsymbol{X}}_{1s}^{\mathrm{T}}{\boldsymbol{X}}_{1s}{\boldsymbol{\beta}}_{0s} \) is the expected value of the Ridge estimator of β0s. The variance–covariance matrix is

$$ {\displaystyle \begin{array}{c}\mathrm{Var}\left({\hat{\boldsymbol{\beta}}}_s^R\left(\lambda \right)\right)={\left({\boldsymbol{X}}_s^{\mathrm{T}}{\boldsymbol{X}}_s+\boldsymbol{\lambda} \boldsymbol{D}\right)}^{-1}{\boldsymbol{X}}_s^{\mathrm{T}}\mathrm{Var}\left(\boldsymbol{y}\right){\boldsymbol{X}}_s{\left({\boldsymbol{X}}_s^{\mathrm{T}}{\boldsymbol{X}}_s+\boldsymbol{\lambda} \boldsymbol{D}\right)}^{-1}\\ {}={\sigma}^2{\left[\begin{array}{cc}n& {0}_n^{\mathrm{T}}\\ {}{0}_n& {\boldsymbol{X}}_{1s}^{\mathrm{T}}{\boldsymbol{X}}_{1s}+\lambda {\boldsymbol{I}}_{\boldsymbol{p}}\end{array}\right]}^{-1}\left[\begin{array}{c}{1}_n^{\mathrm{T}}\\ {}{\boldsymbol{X}}_{1s}^{\mathrm{T}}\end{array}\right]\left[{1}_n\kern0.5em {\boldsymbol{X}}_{1s}\right]{\left[\begin{array}{cc}n& {0}_n^{\mathrm{T}}\\ {}{0}_n& {\boldsymbol{X}}_{1s}^{\mathrm{T}}{\boldsymbol{X}}_{1s}+\lambda {\boldsymbol{I}}_{\boldsymbol{p}}\end{array}\right]}^{-1}\\ {}={\sigma}^2\left[\begin{array}{cc}1/n& {0}_n^{\mathrm{T}}\\ {}{0}_n& \mathrm{Var}\left({\hat{\boldsymbol{\beta}}}_{0s}\left(\lambda \right)\right)\end{array}\right],\end{array}} $$

where \( \mathrm{Var}\left({\hat{\boldsymbol{\beta}}}_{0s}\left(\lambda \right)\right)={\left({\boldsymbol{X}}_{1s}^{\mathrm{T}}{\boldsymbol{X}}_{1s}+\lambda {\boldsymbol{I}}_p\right)}^{-1}{\boldsymbol{X}}_{1s}^{\mathrm{T}}{\boldsymbol{X}}_{1s}{\left({\boldsymbol{X}}_{1s}^{\mathrm{T}}{\boldsymbol{X}}_{1s}+\lambda {\boldsymbol{I}}_p\right)}^{-1} \). So, because in this standardized way, the Ridge solution of the intercept (μ) is the sample mean of the observed responses, and the correlation of this with the rest of the estimated parameters (\( {\hat{\boldsymbol{\beta}}}_{0s}\left(\lambda \right) \)) is null, in the literature it is common to handle this parameter separately from all other coefficients (β0s) (Christensen 2011).

Note that

$$ E\left[{\hat{\boldsymbol{\beta}}}_{0s}\left(\lambda \right)\right]={\boldsymbol{\Gamma}}_s{\left({\boldsymbol{\Lambda}}_s+\lambda {\boldsymbol{I}}_{\boldsymbol{p}}\right)}^{-1}{\boldsymbol{\Lambda}}_s{\boldsymbol{\beta}}_{0s}^{\ast } $$

and

$$ \mathrm{Var}\left({\hat{\boldsymbol{\beta}}}_{0s}\left(\lambda \right)\right)={\boldsymbol{\Gamma}}_s{\left({\boldsymbol{\Lambda}}_p+\lambda {\boldsymbol{I}}_{\boldsymbol{p}}\right)}^{-1}{\boldsymbol{\Lambda}}_s{\left({\boldsymbol{\Lambda}}_p+\lambda {\boldsymbol{I}}_{\boldsymbol{p}}\right)}^{-1}{\boldsymbol{\Gamma}}_s^{\mathrm{T}}, $$

where \( {\boldsymbol{X}}_{1s}^{\mathrm{T}}{\boldsymbol{X}}_{1s}={\boldsymbol{\Gamma}}_s{\boldsymbol{\Lambda}}_s{\boldsymbol{\Gamma}}_s^{\mathrm{T}} \) is the spectral decomposition of \( {\boldsymbol{X}}_{1s}^{\mathrm{T}}{\boldsymbol{X}}_{1s} \) and \( {\boldsymbol{\beta}}_{0s}^{\ast}={\boldsymbol{\Gamma}}_s^{\mathrm{T}}{\boldsymbol{\beta}}_{0s} \). So the conditional expected prediction error at xo when using the Ridge solution is

$$ {\displaystyle \begin{array}{c}{\mathrm{EPE}}_{\lambda}\left({\boldsymbol{x}}_o\right)={E}_{\boldsymbol{Y},{Y}_o\mid \boldsymbol{X},{\boldsymbol{x}}_o}\left[{\left({Y}_o-{\boldsymbol{x}}_o^{\ast \mathrm{T}}{\hat{\boldsymbol{\beta}}}_s^R\left(\lambda \right)\right)}^2\right]\\ {}={E}_{\boldsymbol{Y},{Y}_o\mid \boldsymbol{X},{\boldsymbol{x}}_o}\left[{\left({Y}_o-E\left({Y}_0|{\boldsymbol{x}}_0\right)+E\left({Y}_0|{\boldsymbol{x}}_0\right)-{\boldsymbol{x}}_o^{\ast \mathrm{T}}{\hat{\boldsymbol{\beta}}}_s^R\left(\lambda \right)\right)}^2\right]\\ {}={\sigma}^2+{E}_{\boldsymbol{Y},{Y}_o\mid \boldsymbol{X},{\boldsymbol{x}}_o}\left[{\left({\boldsymbol{x}}_o^{\ast \mathrm{T}}{\boldsymbol{\beta}}_s-{\boldsymbol{x}}_o^{\ast \mathrm{T}}{\hat{\boldsymbol{\beta}}}_s^R\left(\lambda \right)\right)}^2\right]\\ {}={\sigma}^2+\left[{\left({\boldsymbol{x}}_o^{\ast \mathrm{T}}{\boldsymbol{\beta}}_s-{\boldsymbol{x}}_o^{\ast \mathrm{T}}{E}_{\boldsymbol{Y}\mid \boldsymbol{X}}\left({\hat{\boldsymbol{\beta}}}_s^R\left(\lambda \right)\right)\right)}^2\right]+\mathrm{Var}\left({\boldsymbol{x}}_o^{\ast \mathrm{T}}{\hat{\boldsymbol{\beta}}}_s^R\left(\lambda \right)|{\boldsymbol{x}}_{\boldsymbol{o}}\right)\\ {}={\sigma}^2+\left[{\left(\mu +{\boldsymbol{x}}_o^{\mathrm{T}}{\boldsymbol{\beta}}_{0s}-\mu -{\boldsymbol{x}}_o^{\mathrm{T}}{\boldsymbol{\Gamma}}_s{\left({\boldsymbol{\Lambda}}_s+\lambda {\boldsymbol{I}}_{\boldsymbol{p}}\right)}^{-1}{\boldsymbol{\Lambda}}_s{\boldsymbol{\beta}}_{0s}^{\ast}\right)}^2\right]+{\sigma}^2\left[\frac{1}{n}+{\boldsymbol{x}}_o^{\mathrm{T}}{\boldsymbol{\Gamma}}_s{\left({\boldsymbol{\Lambda}}_p+\lambda {\boldsymbol{I}}_{\boldsymbol{p}}\right)}^{-1}{\boldsymbol{\Lambda}}_s{\left({\boldsymbol{\Lambda}}_p+\lambda {\boldsymbol{I}}_{\boldsymbol{p}}\right)}^{-1}{\boldsymbol{\Gamma}}_s^T{\boldsymbol{x}}_o\right]\\ {}={\sigma}^2+\left[{\left({\boldsymbol{x}}_o^{\ast \ast \mathrm{T}}{\boldsymbol{\beta}}_{0s}^{\ast}-{\boldsymbol{x}}_o^{\ast \ast \mathrm{T}}{\left({\boldsymbol{\Lambda}}_s+\lambda {\boldsymbol{I}}_{\boldsymbol{p}}\right)}^{-1}{\boldsymbol{\Lambda}}_s{\boldsymbol{\beta}}_{0s}^{\ast}\right)}^2\right]+{\sigma}^2\left[\frac{1}{n}+{\boldsymbol{x}}_o^{\ast \ast \mathrm{T}}{\left({\boldsymbol{\Lambda}}_p+\lambda {\boldsymbol{I}}_{\boldsymbol{p}}\right)}^{-1}{\boldsymbol{\Lambda}}_s{\left({\boldsymbol{\Lambda}}_p+\lambda {\boldsymbol{I}}_{\boldsymbol{p}}\right)}^{-1}{\boldsymbol{x}}_o^{\ast \ast }\ \right]\\ {}={\sigma}^2+{\left[\sum \limits_{j=1}^p\left(1-\frac{\lambda_j}{\lambda_j+\lambda}\right){x}_{oj}^{\ast \ast }{\beta}_{js}^{\ast}\right]}^2+{\sigma}^2\left(\frac{1}{n}+\sum \limits_{j=1}^p\frac{\lambda_j}{{\left({\lambda}_j+\lambda \right)}^2}{\left({x}_{oj}^{\ast \ast}\right)}^2\right),\end{array}} $$

where \( {\boldsymbol{x}}_o^{\ast \ast }={\boldsymbol{\Gamma}}_s^{\mathrm{T}}{\boldsymbol{x}}_o^{\ast}={\left({x}_{o0}^{\ast \ast },\dots, {x}_{op}^{\ast \ast}\right)}^{\mathrm{T}} \) and \( {\boldsymbol{\beta}}_{0s}^{\ast}={\boldsymbol{\Gamma}}_s^{\mathrm{T}}{\boldsymbol{\beta}}_{0s}={\left({\beta}_{1s}^{\ast},\dots, {\beta}_{ps}^{\ast}\right)}^{\mathrm{T}}. \) The second and third terms of the last equality correspond to the squared bias and the variance of \( {\boldsymbol{x}}_o^{\ast \mathrm{T}}{\hat{\boldsymbol{\beta}}}_s^R\left(\lambda \right) \) as an estimator of \( {\boldsymbol{x}}_o^{\ast \mathrm{T}}{\boldsymbol{\beta}}_s \), respectively. By setting λ = 0, this EPE corresponds to the EPE of the OLS prediction but with standardized variables, while by letting λ be very large, the variance will decrease and the squared bias will increase.

More importantly, because the derivative of EPEλ(xo) with respect to λ, \( \frac{d}{d\lambda}\mathrm{P}{\mathrm{E}}_{\lambda}\left({\boldsymbol{x}}_o\right), \) is a right continuous function at λ = 0, and for X1s of full column rank, \( \underset{\lambda \to {0}^{+}}{\lim}\frac{d}{d\lambda}\mathrm{P}{\mathrm{E}}_{\lambda}\left({\boldsymbol{x}}_o\right)=-2{\sigma}^2\sum \limits_{j=1}^p\frac{{\left({x}_{oj}^{\ast \ast}\right)}^2}{\lambda_j^2}=c \); then for \( \epsilon =-\frac{c}{2}>0 \), we have that λ > 0 such that \( \left|\frac{d}{d\lambda}\mathrm{P}{\mathrm{E}}_{\lambda}\left({\boldsymbol{x}}_o\right)-c\right|<\epsilon \) for λ < λ. From this we have that \( \frac{d}{d\lambda}\mathrm{P}{\mathrm{E}}_{\lambda}\left({\boldsymbol{x}}_o\right)<-\frac{c}{2}+c=\frac{c}{2}<0 \) for all λ < λ, for some λ > 0. Then, at least in the interval [0, λ], the expected prediction error at xo shows a decreasing behavior, which indicates that there is a value of λ such that with the Ridge regression estimation of beta coefficients, we can get a smaller prediction error than with the OLS prediction. Figure 3.3 shows a graphic representation of this behavior of Ridge prediction, where the lower EPE is reached at about λ =  exp (2.22). Figure 3.3 also shows the increasing and decreasing behavior of the bias-squared and the variance involved.

Fig. 3.3
figure 3

Behavior of the expected prediction error at xo of the Ridge solution

When X1s is not full column rank, the previous argument regarding the behavior of the EPE of the Ridge solution is already not valid directly, but it could be used for validating part of the more general case. To see this, first note that the spectral decomposition of \( {\boldsymbol{X}}_{1s}^{\mathrm{T}}{\boldsymbol{X}}_{1s} \) can be reduced to

$$ {\boldsymbol{X}}_{1s}^{\mathrm{T}}{\boldsymbol{X}}_{1s}=\left[{\boldsymbol{\Gamma}}_{1s}\ {\boldsymbol{\Gamma}}_{2s}\right]\left[\begin{array}{cc}{\boldsymbol{\Lambda}}_{1s}& \mathbf{0}\\ {}{\mathbf{0}}^{\mathrm{T}}& {\boldsymbol{\Lambda}}_{2s}\end{array}\right]\left[\begin{array}{c}{\boldsymbol{\Gamma}}_{1s}^{\mathrm{T}}\\ {}{\boldsymbol{\Gamma}}_{2s}^{\mathrm{T}}\end{array}\right]={\boldsymbol{\Gamma}}_{1s}^{\mathrm{T}}{\boldsymbol{\Lambda}}_{1s}{\boldsymbol{\Gamma}}_{1s}, $$

where \( {\boldsymbol{\Lambda}}_{1s}=\mathrm{Diag}\left({\lambda}_1,\dots, {\lambda}_{p^{\ast }}\right) \), p =  rank (X1c) is the rank of design matrix and Λ2s is the null matrix of order (p − p) × (p − p). Furthermore, because \( {\boldsymbol{\Gamma}}_s^{\mathrm{T}}{\boldsymbol{\Gamma}}_s={\boldsymbol{I}}_p \) implies that \( {\boldsymbol{\Gamma}}_{2s}^{\mathrm{T}}{\boldsymbol{X}}_{1s}^{\mathrm{T}}{\boldsymbol{X}}_{1s}{\boldsymbol{\Gamma}}_{2s}=\mathbf{0} \), which in turn implies that X1sΓ2s = 0, then the MLR can be conveniently expressed by

$$ {\displaystyle \begin{array}{c}\boldsymbol{y}={1}_n\mu +{\boldsymbol{X}}_{1s}{\boldsymbol{\beta}}_{0s}+\upepsilon \\ {}={1}_n\mu +{\boldsymbol{X}}_{1s}{\boldsymbol{\Gamma}}_s{\boldsymbol{\Gamma}}_s^{\mathrm{T}}{\boldsymbol{\beta}}_{0s}+\upepsilon \\ {}={1}_n\mu +{\boldsymbol{X}}_{1s}\left[{\boldsymbol{\Gamma}}_{1s}\ {\boldsymbol{\Gamma}}_{2s}\right]{\boldsymbol{\Gamma}}_s^{\mathrm{T}}{\boldsymbol{\beta}}_{0s}+\upepsilon \\ {}={1}_n\mu +\left[{\boldsymbol{X}}_{1s}{\boldsymbol{\Gamma}}_{1s}\ {\boldsymbol{X}}_{1s}{\boldsymbol{\Gamma}}_{2s}\right]{\boldsymbol{\beta}}_{0s}^{\ast }+\upepsilon \\ {}={1}_n\mu +{\boldsymbol{X}}_{1s}^{\ast}{\boldsymbol{\beta}}_{01s}^{\ast }+\upepsilon, \end{array}} $$

where \( {\boldsymbol{\beta}}_{0s}^{\ast }={\boldsymbol{\Gamma}}_s^{\mathrm{T}}{\boldsymbol{\beta}}_{0s} \), \( {\boldsymbol{X}}_{1s}^{\ast}={\boldsymbol{X}}_{1s}{\boldsymbol{\Gamma}}_{1s}, \) and \( {\boldsymbol{\beta}}_{0s}^{\ast }={\boldsymbol{\Gamma}}_s^{\mathrm{T}}{\boldsymbol{\beta}}_{0s}={\left[{\boldsymbol{\beta}}_{0s}^T{\boldsymbol{\Gamma}}_{1s,}\ {\boldsymbol{\beta}}_{0s}^{\mathrm{T}}\ {\boldsymbol{\Gamma}}_{2s}\ \right]}^{\mathrm{T}}={\left[{{\boldsymbol{\beta}}_{01s}^{\ast}}^{\mathrm{T}}\ {\boldsymbol{\beta}}_{02s}^{\ast \mathrm{T}}\right]}^{\mathrm{T}}. \)

Also, from similar arguments, note that the penalized residual sum of squares of the Ridge solution can be expressed by

$$ {\displaystyle \begin{array}{l}{\left(\boldsymbol{y}-{1}_n\mu -{\boldsymbol{X}}_{1s}{\boldsymbol{\beta}}_{0s}\right)}^{\mathrm{T}}\left(\boldsymbol{y}-{1}_n\mu -{\boldsymbol{X}}_{1s}{\boldsymbol{\beta}}_{0s}\right)+\lambda {\boldsymbol{\beta}}_{0s}^{\mathrm{T}}{\boldsymbol{\beta}}_{0s}\\ {}={\left(\boldsymbol{y}-{1}_n\mu -{\boldsymbol{X}}_{1s}^{\ast }{\boldsymbol{\beta}}_{01s}^{\ast}\right)}^{\mathrm{T}}\left(\boldsymbol{y}-{1}_n\mu -{\boldsymbol{X}}_{1s}^{\ast }{\boldsymbol{\beta}}_{01s}^{\ast}\right)+\lambda {\boldsymbol{\beta}}_{0s}^{\mathrm{T}}\left({\boldsymbol{\Gamma}}_{1s}^{\mathrm{T}}{\boldsymbol{\Gamma}}_{1s}+{\boldsymbol{\Gamma}}_{2s}^{\mathrm{T}}{\boldsymbol{\Gamma}}_{2s}\right){\boldsymbol{\beta}}_{0s}\\ {}={\left(\boldsymbol{y}-{1}_n\mu -{\boldsymbol{X}}_{1s}^{\ast }{\boldsymbol{\beta}}_{01s}^{\ast}\right)}^{\mathrm{T}}\left(\boldsymbol{y}-{1}_n\mu -{\boldsymbol{X}}_{1s}^{\ast }{\boldsymbol{\beta}}_{01s}^{\ast}\right)+\lambda {\boldsymbol{\beta}}_{01s}^{\ast \mathrm{T}}{\boldsymbol{\beta}}_{01s}^{\ast }+\lambda {\boldsymbol{\beta}}_{02s}^{\ast \mathrm{T}}{\boldsymbol{\beta}}_{02s}^{\ast}\end{array}} $$

This function of \( {\boldsymbol{\beta}}_{0s}^{\ast } \) is minimized at \( {\overset{\sim }{\boldsymbol{\beta}}}_{0s}^{\ast}\left(\lambda \right)={\left({\overset{\sim }{\boldsymbol{\beta}}}_{01s}^{\ast \mathrm{T}}\left(\lambda \right),{\mathbf{0}}_{p-{p}^{\ast}}^{\mathrm{T}}\ \right)}^{\mathrm{T}} \), where \( {\overset{\sim }{\boldsymbol{\beta}}}_{01s}^{\ast \mathrm{T}}\left(\lambda \right)={\left({\boldsymbol{\Lambda}}_{1s}+\lambda {\boldsymbol{I}}_{p^{\ast }}\right)}^{-1}{\boldsymbol{\Gamma}}_{1s}^{\mathrm{T}}{\boldsymbol{X}}_{1S}^{\mathrm{T}}\boldsymbol{y} \) is the Ridge solution of the MLR expressed in terms of \( {\boldsymbol{X}}_{1s}^{\ast}{\boldsymbol{\beta}}_{0s}^{\ast } \). Furthermore, because \( {\boldsymbol{\beta}}_{0s}^{\ast }={\boldsymbol{\Gamma}}_s^{\mathrm{T}}{\boldsymbol{\beta}}_{0s} \) is a non-singular transformation, the original Ridge solution of βs can be expressed in terms of \( {\overset{\sim }{\boldsymbol{\beta}}}_{0s}^{\ast}\left(\lambda \right) \) as \( {\hat{\boldsymbol{\beta}}}_s\left(\lambda \right)={\left({\overline{y}}_n,{\hat{\boldsymbol{\beta}}}_{0s}^{\ast \mathrm{T}}\left(\lambda \right)\right)}^{\mathrm{T}} \), where \( {\hat{\boldsymbol{\beta}}}_{0s}\left(\lambda \right)={\boldsymbol{\Gamma}}_s{\hat{\boldsymbol{\beta}}}_{0s}^{\ast \mathrm{T}}\left(\lambda \right)={\boldsymbol{\Gamma}}_{1s}{\overset{\sim }{\boldsymbol{\beta}}}_{01s}^{\ast}\left(\lambda \right). \) Then, in a similar fashion as before, the conditional expected prediction error at xo by using the Ridge solution in this case can be computed as

$$ {\displaystyle \begin{array}{c}{\mathrm{EPE}}_{\lambda}\left({\boldsymbol{x}}_o\right)={E}_{\boldsymbol{Y},{Y}_o\mid \boldsymbol{X},{\boldsymbol{x}}_o}\left[{\left({Y}_o-{\boldsymbol{x}}_o^{\ast \mathrm{T}}{\hat{\boldsymbol{\beta}}}_s^R\left(\lambda \right)\right)}^2\right]\\ {}={\sigma}^2+{E}_{\boldsymbol{Y},{Y}_o\mid \boldsymbol{X},{\boldsymbol{x}}_o}\left[{\left({\boldsymbol{x}}_o^{\ast \mathrm{T}}{\boldsymbol{\beta}}_s-{\boldsymbol{x}}_o^{\ast \mathrm{T}}{\hat{\boldsymbol{\beta}}}_s^R\left(\lambda \right)\right)}^2\right]\\ {}={\sigma}^2+\left[{\left({\boldsymbol{x}}_o^{\ast \mathrm{T}}{\boldsymbol{\beta}}_s-{\boldsymbol{x}}_o^{\ast \mathrm{T}}{E}_{\boldsymbol{Y}\mid \boldsymbol{X}}\left({\hat{\boldsymbol{\beta}}}_s^R\left(\lambda \right)\right)\right)}^2\right]+\mathrm{Var}\left({\boldsymbol{x}}_o^{\ast \mathrm{T}}{\hat{\boldsymbol{\beta}}}_s^R\left(\lambda \right)|{\boldsymbol{x}}_{\boldsymbol{o}}\right)\\ {}={\sigma}^2+\left[{\left(\mu +{\boldsymbol{x}}_o^{\mathrm{T}}{\boldsymbol{\Gamma}}_s{\boldsymbol{\Gamma}}_s^{\mathrm{T}}{\boldsymbol{\beta}}_{0s}-\mu -{\boldsymbol{x}}_o^{\mathrm{T}}{\boldsymbol{\Gamma}}_{1s}{E}_{\boldsymbol{Y}\mid \boldsymbol{X}}\left({\hat{\boldsymbol{\beta}}}_{0s}^{\ast \mathrm{T}}\left(\lambda \right)\right)\right)}^2\right]+\left[\frac{\sigma^2}{n}+\mathrm{Var}\left({\boldsymbol{x}}_o^{\mathrm{T}}{\boldsymbol{\Gamma}}_s{\hat{\boldsymbol{\beta}}}_{0s}^{\ast \mathrm{T}}\left(\lambda \right)|{\boldsymbol{x}}_{\boldsymbol{o}}\right)\right]\\ {}={\sigma}^2+\left[{\left(\mu +{\boldsymbol{x}}_o^{\mathrm{T}}{\boldsymbol{\Gamma}}_s{\boldsymbol{\Gamma}}_s^{\mathrm{T}}{\boldsymbol{\beta}}_{0s}-\mu -{\boldsymbol{x}}_o^{\mathrm{T}}{\boldsymbol{\Gamma}}_{1s}{\left({\boldsymbol{\Lambda}}_{1s}+\lambda {\boldsymbol{I}}_{p^{\ast }}\right)}^{-1}{\boldsymbol{\Gamma}}_{1s}^{\mathrm{T}}{\boldsymbol{X}}_{1c}^{\mathrm{T}}{\boldsymbol{X}}_{1c}{\boldsymbol{\beta}}_{0S}\right)}^2\right]\\ {}+{\sigma}^2\left[\frac{1}{n}+{\boldsymbol{x}}_o^{\mathrm{T}}{\boldsymbol{\Gamma}}_{1s}{\left({\boldsymbol{\Lambda}}_{1s}+\lambda {\boldsymbol{I}}_{{\boldsymbol{p}}^{\ast}}\right)}^{-1}{\boldsymbol{\Lambda}}_{1s}{\left({\boldsymbol{\Lambda}}_{1s}+\lambda {\boldsymbol{I}}_{{\boldsymbol{p}}^{\ast}}\right)}^{-1}{\boldsymbol{\Gamma}}_{1s}^T{\boldsymbol{x}}_o\right]\\ {}=\left\{\begin{array}{cc}{\sigma}^2+\left[{\left(\sum \limits_{j=1}^{p^{\ast }}\left(1-\frac{\lambda_j}{\lambda_j+\lambda}\right){x}_{oj}^{\ast }{\beta}_{oj}^{\ast}\right)}^2\right]+{\sigma}^2\left[\frac{1}{n}+\sum \limits_{j=1}^{p^{\ast }}\frac{\lambda_j}{{\left({\lambda}_j+\lambda \right)}^2}{\left({x}_{oj}^{\ast}\right)}^2\right]& \mathrm{if}\ {x}_o={\boldsymbol{\Gamma}}_{1s}{\boldsymbol{a}}_1\\ {}{\sigma}^2+{\left[{\boldsymbol{x}}_o^T{\boldsymbol{\beta}}_{0s}\right]}^2+\frac{\sigma^2}{n}& \mathrm{if}\ {x}_o={\boldsymbol{\Gamma}}_{2s}{\boldsymbol{a}}_2,\end{array}\right.\end{array}} $$

where \( {\boldsymbol{x}}_o^{\ast}={\boldsymbol{\Gamma}}_s^{\mathrm{T}}{\boldsymbol{x}}_o={\left[{x}_{o1}^{\ast },\dots, {x}_{op}^{\ast}\right]}^{\mathrm{T}} \), \( {\boldsymbol{\beta}}_o^{\ast}={\boldsymbol{\Gamma}}_s^{\mathrm{T}}{\boldsymbol{\beta}}_{os}={\left[{\beta}_{o1}^{\ast},\dots, {\beta}_{op}^{\ast}\right]}^{\mathrm{T}}, \) and \( {\boldsymbol{a}}_1\mathbf{\in}{\mathbb{R}}^{p^{\ast }} \) and \( {\boldsymbol{a}}_2\mathbf{\in}{\mathbb{R}}^{p-{p}^{\ast }} \). So, using a similar argument as before, in the first case (xo = Γ1sa1), the value of λ > 0 is such that the expected prediction error at xo is better than that obtained with the OLS approach, \( {\hat{\boldsymbol{\beta}}}_s(0)=\underset{\lambda \to 0}{\lim }{\hat{\boldsymbol{\beta}}}_s\left(\lambda \right)={\left[{\overline{y}}_n,{\left({\boldsymbol{\Gamma}}_{1s}{\Lambda}_{1s}^{-1}{\boldsymbol{\Gamma}}_{1s}^{\mathrm{T}}{X}_{1c}^{\mathrm{T}}\boldsymbol{y}\right)}^{\mathrm{T}}\right]}^{\mathrm{T}} \). In the second case, xo = Γ2sa2, the EPE(xo) in both approaches is the same and doesn’t depend on λ, so in such cases, no improved gain with regard to the Ridge solution is achieved. A third case was included, that is, when the target feature is of the form xo = Γ1sa1 + Γ2sa2. In this case, under the described argument, the advantage of Ridge regression over the OLS approach in a prediction context is not clear.

However, in practice, we don’t know the true value of the parameters, and we need to evaluate the test error in all possible values of the training sample, which we also don’t have. So a common way to choose the λ value is by cross-validation. For more details about validation strategies, see Chap. 4. For example, with a k-fold CV, the complete data set is divided into K balanced disjoint subsets, Sk, k = 1, …, K. One subset is used as validation and the rest are used to fit the model in each value of a chosen grid of values of λ. This procedure is repeated K times, where each time a subset in the partition is taken as the validation set. A more detailed k-fold CV procedure is described below:

  1. 1.

    First, choose a grid of values of λ, λ = (λ1, …, λL) .

  2. 2.

    Remove the subset Sk and for each value λl in the grid, fit the model with the remaining K − 1 elements of the partition denoted by \( {\hat{\boldsymbol{\beta}}}_{-k}^R\left({\lambda}_l\right) \), the corresponding Ridge estimation of β, and compute the average prediction error across all observations in the validation set Sk as

    $$ {\hat{\mathrm{APE}}}_{-k}\left({\lambda}_l\right)=\frac{1}{\left|{S}_k\right|}\sum \limits_{y_i\in {S}_k}{\left({y}_i-{\boldsymbol{x}}_i^{\ast \mathrm{T}}{\hat{\boldsymbol{\beta}}}_{-k}^R\left({\lambda}_l\right)\right)}^2, $$

    where ∣Sk∣ denotes the total observations in partition k.

  3. 3.

    Choose as the best value of λ in the grid \( \left({\overset{\sim }{\lambda}}^{\ast}\right) \), the one with the lower average prediction error across all partitions, that is

    $$ {\overset{\sim }{\lambda}}^{\ast }=\arg \underset{\lambda_l}{\min}\hat{\mathrm{APE}}\left({\lambda}_l\right), $$

    where \( \hat{\mathrm{APE}}\left({\lambda}_l\right)=\frac{1}{K}\sum \limits_{k=1}^K{\hat{\mathrm{APE}}}_{-k}\left({\lambda}_l\right) \).

  4. 4.

    Once \( {\overset{\sim }{\lambda}}^{\ast } \) is chosen, we fit the model with the complete data set and the prediction of new individuals with feature xo can be made with \( {\hat{y}}_i={\boldsymbol{x}}_o^{\ast T}{\hat{\boldsymbol{\beta}}}^R\left({\overset{\sim }{\lambda}}^{\ast}\right) \), where \( {\hat{\boldsymbol{\beta}}}^R\left({\overset{\sim }{\lambda}}^{\ast}\right) \) is the Ridge estimation of β at \( \lambda ={\overset{\sim }{\lambda}}^{\ast } \).

It is important to point out that very often the performance of the model needs to be evaluated for comparison purposes with other competing models. A common way to do this is to split the data set several times into two subsets, one for training the model (Dtr) (to fit the model) and the other for testing (Dtst) it, in which the predictive ability of a model is tested. In each splitting, only the training data set (Dtr) is used to train the model (by steps 1–3 before fitting the whole training data set), and the prediction evaluation of the fitted model is made with the testing data set, as explained before in point 4. The prediction evaluation of the testing data set is done by an empirical “estimate” of the EPE, \( \mathrm{MSE}=\frac{1}{\left|{D}_{\mathrm{tst}}\right|}{\sum}_{i\in {D}_{\mathrm{tst}}}{\left({y}_i-{\boldsymbol{x}}_o^{\ast \mathrm{T}}{\hat{\boldsymbol{\beta}}}^R\left({\overset{\sim }{\lambda}}^{\ast}\right),\right)}^2 \), and finally, an average evaluation of the performance of the model is obtained across all chosen splittings. See Chap. 4 for more explicit details.

The Ridge solution can also be obtained from a Bayesian formulation. To do this, consider the MLR model described before with standardized features and the vector of residuals distributed as Nn(0, σ2In). With this assumption, the vector of responses y is distributed in a multivariate normal distribution with vector mean 1nμ + X1sβ0s and variance–covariance matrix σ2In. Then, to complete the Bayesian formulation, assume \( {\boldsymbol{\beta}}_{os}\sim {N}_p\left(\mathbf{0},{\sigma}_{\beta}^2{\boldsymbol{I}}_p\right) \) as the prior distribution of the beta coefficients in β0s and a “flat” prior for the intercept μ, where σ2 and \( {\sigma}_{\beta}^2 \) are known. Under this Bayesian specification, the posterior distribution of βs is

$$ {\displaystyle \begin{array}{c}f\left({\boldsymbol{\beta}}_s|\boldsymbol{y},{\boldsymbol{X}}_{1s}\right)\propto L\left(\boldsymbol{\beta}, {\sigma}^2;\boldsymbol{y},\boldsymbol{X}\right)f\left(\mu \right)f\left({\boldsymbol{\beta}}_{0s}\right)\\ {}\propto \exp \left[-\frac{1}{2{\sigma}^2}{\left(\boldsymbol{y}-{\boldsymbol{X}}_s{\boldsymbol{\beta}}_s\right)}^{\mathrm{T}}\left(\boldsymbol{y}-{\boldsymbol{X}}_s{\boldsymbol{\beta}}_s\right)\right]\exp \left(-\frac{1}{2{\sigma}_{\beta}^2}{\boldsymbol{\beta}}_{0s}^{\mathrm{T}}{\boldsymbol{\beta}}_{0s}\right)\\ {}\propto \exp \left\{-\frac{1}{2}\left[{\boldsymbol{\beta}}_s^{\mathrm{T}}\left({\sigma}_{\beta}^{-2}\boldsymbol{D}+{\sigma}^{-2}{\boldsymbol{X}}_s^{\mathrm{T}}{\boldsymbol{X}}_s\right){\boldsymbol{\beta}}_s-2{\sigma}^{-2}{\boldsymbol{y}}^{\mathrm{T}}{\boldsymbol{X}}_s{\boldsymbol{\beta}}_s\right]\right\}\\ {}\propto \exp \left\{-\frac{1}{2}{\left({\boldsymbol{\beta}}_s-{\tilde{\boldsymbol{\beta}}}_s\right)}^{\mathrm{T}}{\tilde{\boldsymbol{\Sigma}}}_{\beta}^{-1}\left({\boldsymbol{\beta}}_s-{\tilde{\boldsymbol{\beta}}}_s\right)\right\},\end{array}} $$

where \( {\overset{\sim }{\boldsymbol{\Sigma}}}_{\beta }={\left({\sigma}_{\beta}^{-2}\boldsymbol{D}+{\sigma}^{-2}{\boldsymbol{X}}_s^{\mathrm{T}}{\boldsymbol{X}}_s\right)}^{-1}={\sigma}^2{\left({\sigma}^2/{\sigma}_{\beta}^2\boldsymbol{D}+{\boldsymbol{X}}_s^{\mathrm{T}}{\boldsymbol{X}}_s\right)}^{-1} \), \( {\overset{\sim }{\boldsymbol{\beta}}}_s={\sigma}^{-2}{\overset{\sim }{\boldsymbol{\Sigma}}}_{\beta }{\boldsymbol{X}}_s^{\mathrm{T}}\boldsymbol{y}, \) and D is the diagonal penalty matrix. That is, the posterior distribution of βs is a multivariate normal distribution with vector mean \( {\overset{\sim }{\boldsymbol{\beta}}}_s={\sigma}^{-2}{\overset{\sim }{\boldsymbol{\Sigma}}}_{\beta }{\boldsymbol{X}}_s^{\mathrm{T}}\boldsymbol{y}={\left({\sigma}^2/{\sigma}_{\beta}^2\boldsymbol{D}+{\boldsymbol{X}}_s^{\mathrm{T}}{\boldsymbol{X}}_s\right)}^{-1}{\boldsymbol{X}}_s^{\mathrm{T}}\boldsymbol{y} \) and variance–covariance matrix \( {\overset{\sim }{\boldsymbol{\Sigma}}}_{\beta }={\sigma}^2{\left({\sigma}^2/{\sigma}_{\beta}^2\boldsymbol{D}+{\boldsymbol{X}}_s^{\mathrm{T}}{\boldsymbol{X}}_s\right)}^{-1}. \) Then, by taking \( \lambda ={\sigma}^2/{\sigma}_{\beta}^2 \), we have that the mean/mode of the posterior distribution of βs coincides with the Ridge estimation described before, \( {\overset{\sim }{\boldsymbol{\beta}}}^R\left(\lambda \right) \).

Example 2

We considered a genomic example to illustrate the Ridge regression approach and the CV process to choose the learning parameter λ (WheatMadaToy, PH the response). This data set consists of 50 observations corresponding to 50 lines and a relationship genomic matrix computed from marker information. Table 3.1 shows the prediction behavior of the Ridge and the OLS approaches in terms of the MSEP, across five different splittings obtained by partitioning the complete data set into five subsets: the data of a subset are used as a testing set and the rest to train the model. For training the model, a five-fold cross-validation (5FCV) was used along the lines following steps 1–3 described before, and the prediction performance was done following step 4.

Table 3.1 Prediction behavior of the Ridge and OLS regression models across different partitions of the complete data set: one subset of the partition (20%) is used for evaluating the performance of the model and the rest (80%) for training the model. RR denotes Ridge regression method

Table 3.1 indicates that in four out of five partitions, the Ridge regression shows less MSE than the corresponding OLS approach. In all these cases, the MSE of the OLS was, on average, 31.46% greater than that of the Ridge regression approach, and in general, on average, by 31.14% (MSE = 421.8834 for Ridge and MSE = 655.8596 for OLS). From this, we have that the Ridge regression approach shows a better prediction performance than the OLS. The large variation of the MSE between folds observed in this example could indicate that for obtaining a more precise comparison between models, a larger number of partitions need to be used. Often, the use of more partitions is avoided when larger data sets are used in applications.

The R code used for obtaining this result is the following:

######################R code for Example 2 ########################## rm(list=ls()) library(BMTME) data("WheatMadaToy") dat_F = phenoMada dim(dat_F) dat_F$GID = as.character(dat_F$GID) G = genoMada eig_G = eigen(G) G_0.5 = eig_G$vectors%*%diag(sqrt(eig_G$values))%*%t(eig_G$vectors) X = G_0.5 y = dat_F$PH n = length(y) source('TR_RR.R') #5FCV set.seed(3) K = 5 Tab = data.frame() Grpv = findInterval(cut(sample(1:n,n),breaks=K),1:n) for(i in 1:K) { Pos_tr = which(Grpv!=i) y_tr = y[Pos_tr] X_tr = X[Pos_tr,] TR_RR = Tr_RR_f(y_tr,X_tr,K=5,KG=100,KR=1) lambv = TR_RR$lambv #Tst y_tst = y[-Pos_tr]; X_tst = X[-Pos_tr,] #RR Pred_RR = Pred_RR_f(y_tst,X_tst,TR_RR) #OLS Pred_ols = Pred_ols_f(y_tst,X_tst,y_tr,X_tr) Tab = rbind(Tab,data.frame(Sim=i,MSEP_RR = Pred_RR$MSEP, MSEP_ols = Pred_ols$MSEP)) cat('i = ', i,'\n') } Tab

Tr_RR_f, Pred_RR_f, and Pred_ols_f are R functions accessed by the command source(‘TR_RR.R’), where TR_RR.R is the file R script defined in Appendix 1.

From the last code, three things are important to point out:

  1. 1.

    The Grpv contains the information of the K=5 folds for the outer CV implemented and each time the model is trained with K − 1 and tested with the remaining fold.

  2. 2.

    The function that trains the model under Ridge regression is called Tr_RR_f, while the function that obtains the predictions of the testing set of this trained model is Pred_RR_f; both functions are fully described in Appendix 1.

  3. 3.

    The predictions under the OLS method are obtained with the function Pred_ols_f, which is also fully detailed in Appendix 1. It is important to point out that the function Tr_RR_f internally implements an inner k-fold CV to tune the hyperparameter λ required in Ridge regression.

Example 3 (Simulation)

To get a better idea about the behavior of the Ridge solution, here we report the results of a small simulation study in a scenario where the number of observations (n=100) is less than the number of features (p = 500) and these are moderately correlated. Specifically, we generated 100 data sets, each of size 100, from the following model:

$$ {y}_i=5+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_0+{\epsilon}_i, $$

where the vector of beta coefficients (β0) was set to the values shown in Fig. 3.4, and the features of all the individuals in each data set were generated from a multivariate normal distribution centered on the null vector and variance–covariance matrix Σ = 0.25Ip + 0.75Jp, where Ip and Jp are the identity matrix and matrix of ones of dimension p × p. The random errors (ϵi) were simulated from a normal distribution with mean 0 and variance 0.025.

Fig. 3.4
figure 4

Beta coefficients used in simulation: βj, j = 1, …, p

The behavior of the Ridge and OLS solutions across the 100 simulated data sets is shown in Fig. 3.5. The MSE of Ridge regression is located on the x-axis and the corresponding MSE of the OLS is located on the y-axis. On average, the OLS resulted in an MSE equal to 808.81, which is 30.59% larger than the average MSE (619.32) of the Ridge approach. In terms of the percentage of simulations in favor of each method, Ridge regression was better in 78 out of 100 simulations, while the OLS was better only in 22 out of 100 simulations. In general, from this small simulation study we obtained more evidence in favor of the Ridge regression method.

Fig. 3.5
figure 5

MSE of Ridge regression (MSE RR) versus MSE OLS regression (MSE OLS)

The R code used for obtaining this result is the following:

##########################R code for Example 3###################### rm(list=ls(all=TRUE)) library(mvtnorm) library(MASS) source('TR_RR.R') set.seed(10) n = 100 p = 500 Var = 0.25*diag(p)+0.75 Tab = data.frame() betav = rnorm(p,rpois(p,1),2) plot(betav,xlab=expression(j),ylab=expression(beta[j])) for(i in 1:100) { X = rmvnorm(n,rep(0,p),Var) dim(X) y = 5+X%*%betav + rnorm(n,0,0.5) Pos_tr = sample(1:n,n*0.80) y_tr = y[Pos_tr]; X_tr = X[Pos_tr,] y_tst = y[-Pos_tr]; X_tst = X[-Pos_tr,] #Training RR TR_RR = Tr_RR_f(y_tr,X_tr,K=5,KG=100,KR=1) TR_RR$lamb_o lambv = TR_RR$lambv plot(log(TR_RR$lambv),TR_RR$iPEv_mean) #Prediction RR in testing data Pred_RR = Pred_RR_f(y_tst,X_tst,TR_RR) Pred_RR$MSEP #OLS Pred_ols = Pred_ols_f(y_tst,X_tst,y_tr,X_tr) Pred_ols Tab = rbind(Tab,data.frame(Sim=i,MSEP_RR = Pred_RR$MSEP, MSEP_ols = Pred_ols$MSEP)) cat('i = ', i,'\n')} Mean_v = colMeans(Tab) (Mean_v[3]-Mean_v[2])/Mean_v[2]*100 mean(Tab$MSEP_RR<Tab$MSEP_ols) Pos = which(Tab$MSEP_RR<Tab$MSEP_ols) mean((Tab$MSEP_ols[Pos]-Tab$MSEP_RR[Pos])/Tab$MSEP_RR[Pos])*100 mean((Tab$MSEP_RR[-Pos]-Tab$MSEP_ols[-Pos])/Tab$MSEP_ols[-Pos])*100 plot(Tab$MSEP_RR, Tab$MSEP_ols, col=ifelse(Tab$MSEP_RR<Tab$MSEP_ols,3,2), xlab='MSEP RR', ylab='MSEP OLS') abline(a=0,b=1)

The TR_RR.R script file is the same as the one defined in Example 2 in Appendix 1.

3.6.2 Lasso Regression

Like Ridge regression, the Lasso regression solves the OLS problem but penalizes the residual sum squared in a slightly different way. With the standardized variables, the Lasso estimator of βs is defined as

$$ {\overset{\sim }{\boldsymbol{\beta}}}_s^L\left(\lambda \right)=\arg \underset{\mu, {\boldsymbol{\beta}}_{0s}}{\min}\mathrm{PRS}{\mathrm{S}}_{\lambda}\left({\boldsymbol{\beta}}_s\right),\kern0.75em $$

where now \( \mathrm{PRS}{\mathrm{S}}_{\lambda}\left({\boldsymbol{\beta}}_s\right)=\sum \limits_{i=1}^n{\left({y}_i-\mu -\sum \limits_{j=1}^p{x}_{ijs}{\beta}_{js}\right)}^2+\lambda {\sum}_{j=1}^p\left|{\beta}_{js}\right| \) is the RSS(β) but penalized by the sum of the absolute regression coefficients. For λ = 0, the solution is the OLS, while when λ is large, the OLS solutions are shrunken toward 0 (Tibshirani 1996).

Note that for any given values of β0s, the value of μ that minimizes PRSSλ(βs) is the sample mean of the responses, \( \overset{\sim }{\mu }=\frac{1}{n}{\sum}_{i=1}^n{y}_i \), the same as the Ridge estimator. However, the rest of the Lasso estimator of βs, β0s, cannot be obtained analytically, so numerical methods are often used.

Although there are efficient algorithms for computing the entire regularization path for the Lasso regression coefficients (Efron et al. 2004; Friedman et al. 2008), here we will describe the coordinate-wise descent given in Friedman et al. (2007). The idea of this method is to successively optimize the PRSSλ(βs) one parameter at a time (beta coefficient). Holding βks, j ≠ k, fixed at their current values \( {\overset{\sim }{\beta}}_{js}\left(\lambda \right), \) the value of βk that minimizes PRSSλ(βs) is given by

$$ {\displaystyle \begin{array}{c}{\tilde{\beta}}_{ks}^{\ast }\ \left(\lambda \right)=S\left(\sum \limits_{i=1}^n{x}_{ijs}\left({y}_i-{\tilde{y}}_i^{(k)}\right),\lambda \right)\\ {}=S\left(n{\tilde{\beta}}_{ks}\left(\lambda \right)+\sum \limits_{i=1}^n{x}_{ijs}\left({y}_i-{\tilde{y}}_i\right),\lambda \right),\end{array}} $$

where \( \tilde{y}_{i}^{(k)}=\overline{y}+{\sum}_{\underset{j\ne k}{j=1}}^p{x}_{ijs}{\overset{\sim }{\beta}}_{js}\left(\lambda \right) \) and \( S\left(\beta, \lambda \right)=\left\{\begin{array}{cc}\beta -\lambda & \mathrm{if}\ \beta >0\ \mathrm{and}\ \lambda <\left|\beta \right|\\ {}\beta +\lambda & \mathrm{if}\ \beta <0\ \mathrm{and}\ \lambda <\left|\beta \right|\\ {}0& \mathrm{if}\ \lambda \ge \left|\beta \right|\end{array}\right. \). To obtain the Lasso estimate of β0s, this process is repeated across all the coefficients until a convergence threshold criterion is reached.

This algorithm can be implemented with the glmnet R package (Friedman et al. 2010) as part of a more general penalty regression (elastic net), which is defined as a combination of the Ridge and Lasso penalties. Due to the structure of the algorithm, this can be used on very large data sets and can benefit from sparsity in the explanatory variables (Friedman et al. 2008).

Equivalently, the Lasso estimator of beta coefficients β0s can be defined as

$$ {\displaystyle \begin{array}{l}{\tilde{\boldsymbol{\beta}}}_{0s}^L\left(\lambda \right)=\underset{{\boldsymbol{\beta}}_{0s}}{\mathrm{argmin}}\sum \limits_{i=1}^n{\left({y}_i-\overline{y}-\sum \limits_{j=1}^p{x}_{ijs}{\beta}_{js}\right)}^2\\ {}\mathrm{subject}\ \mathrm{to}\ \sum \limits_{j=1}^p\left|{\beta}_{js}\right|\le t\end{array}} $$

With this, a graphic representation of the Lasso estimator is like the Ridge (see Fig. 3.6). The nested ellipsoids correspond to contour plots of RSS(β) and the green region is the restriction with t = 32, which contains the Lasso solution. Indeed, the Lasso solution is the first point of the contours that touches the square, and this will sometimes be in a corner that makes some coefficients zero. Because there are no corners in Ridge regression, this will rarely happen (Tibshirani 1996).

Fig. 3.6
figure 6

Graphic representation of the Lasso solution of the OLS with restriction \( {\sum}_{j=1}^p\left|{\beta}_j\right|<{3}^2 \). The green region contains the Lasso solution

The Lasso estimator can also be derived from a Bayesian perspective. Supposing that the vector of residuals is distributed as Nn(0, σ2In), like in the Ridge regression case, and assuming that the priors of β0s are independent and identically distributed according to Laplace distribution with mean 0 and variance \( {\sigma}_{\beta}^2 \), and adopting a “flat” prior for μ, with known σ2 and \( {\sigma}_{\beta}^2 \), the posterior distribution of β is

$$ {\displaystyle \begin{array}{c}f\left({\boldsymbol{\beta}}_s|\boldsymbol{y},{\boldsymbol{X}}_{1s}\right)\propto L\left(\boldsymbol{\beta}, {\sigma}^2;\boldsymbol{y},\boldsymbol{X}\right)f\left(\mu \right)f\left({\boldsymbol{\beta}}_{0s}\right)\\ {}\propto \exp \left[-\frac{1}{2{\sigma}^2}{\left(\boldsymbol{y}-{\boldsymbol{X}}_s{\boldsymbol{\beta}}_s\right)}^{\mathrm{T}}\left(\boldsymbol{y}-{\boldsymbol{X}}_s{\boldsymbol{\beta}}_s\right)\right]\prod \limits_{i=1}^n\exp \left(-\frac{\sqrt{2}}{\sigma_{\beta}^2}\left|{\beta}_{js}\right|\right)\\ {}\propto \exp \left\{-\frac{1}{2{\sigma}^2}\left[\sum \limits_{i=1}^n\left({y}_i-\mu -\sum \limits_{j=1}^p{x}_{ijs}{\beta}_{js}\right)+\lambda \sum \limits_{j=1}^p\left|{\beta}_{js}\right|\right]\right\},\end{array}} $$

where \( \lambda =\sqrt{8}{\sigma}^2/{\sigma}_{\beta}^2 \). Then, the model of the posterior distribution of βs corresponds to the Lasso estimator described before, \( {\overset{\sim }{\boldsymbol{\beta}}}^L\left(\lambda \right) \).

The performance of Lasso regression in terms of prediction error is sometimes comparable to Ridge regression (Hastie et al. 2009). However, as we pointed out before, and based on the nature of the restriction term, for any given value of t, only a subset of the coefficients βjs is nonzero, so this gives a sparse solution (Efron et al. 2004).

Example 4

To illustrate Lasso regression, here we considered the data used in Example 2, but instead of using a five-fold cross-validation (5FCV) to explore the behavior of this, we built 100 random splittings of the complete data set: 80% for training and 20% for testing. Figure 3.7 presents a representation of the MSE of the Lasso regression (y-axis) and the MSE corresponding to Ridge regression (x-axis). In 81 out of 100 random splittings, the Ridge regression approach gives a better performance, and in this case, on average, the Lasso regression shows an MSE that is 92.13% greater than the Ridge solution. In the other cases, the Ridge was worse, on average, by 30.91%.

Fig. 3.7
figure 7

MSE of Ridge regression versus MSE of Lasso regression in 100 random splittings of data: 20% for testing and 80% for training

On average across all the splittings, the performance of the Ridge regression (average = 118.9726 and standard deviation = 50.7193 of MSE) was superior to the Lasso (200.6021 and standard deviation = 222.5494 of MSE) by 68.61%, but this was better than the OLS solution (average = 1609.4635 and standard deviation = 1105.4434 of MSE) by 802.32%, while the Ridge was 1352.80% better than the OLS estimate.

########################R code for Example 4######################## rm(list=ls()) library(BMTME) data("WheatMadaToy") dat_F = phenoMada dim(dat_F) dat_F$GID = as.character(dat_F$GID) G = genoMada eig_G = eigen(G) G_0.5 = eig_G$vectors%*%diag(sqrt(eig_G$values))%*%t(eig_G$vectors) X = G_0.5 y = dat_F$PH n = length(y) library(glmnet) #5FCV set.seed(3) K = 5 Tab = data.frame() set.seed(1) for(k in 1:100) { Pos_tr = sample(1:n,n*0.8) y_tr = y[Pos_tr]; X_tr = X[Pos_tr,]; n_tr = dim(X_tr)[1] y_tst = y[-Pos_tr]; X_tst = X[-Pos_tr,] #Partition for internal training the model Grpv_k = findInterval(cut(sample(1:n_tr,n_tr),breaks=5),1:n_tr) #RR A_RR = cv.glmnet(X_tr,y_tr,alpha=0,foldid=Grpv_k,type.measure='mse') yp_RR = predict(A_RR,newx=X_tst,s='lambda.min') #LR A_LR = cv.glmnet(X_tr,y_tr,alpha=1,foldid=Grpv_k,type.measure='mse') yp_LR = predict(A_LR,newx=X_tst,s='lambda.min') #OLS A_OLS = glmnet(X_tr,y_tr,alpha=1,lambda=0) yp_OLS = predict(A_OLS,newx=X_tst) Tab = rbind(Tab,data.frame(PT=k,MSEP_RR = mean((y_tst-yp_RR)^2), MSEP_LR = mean((y_tst-yp_LR)^2), MSEP_OLS = mean((y_tst-yp_OLS)^2))) cat('k = ', k,'\n') } Tab

Now the key components of the just given R code are

  1. 1.

    Hundred random partitions were implemented where each partition is obtained with Pos_tr = sample(1:n,n*0.8), which means that 80% of the data is used for training and 20% for testing, and for each training set, an inner K=5 fold CV is performed to tune the λ hyperparameter.

  2. 2.

    The Grpv_k contains the information of the K=5 fold inner CV implemented to tune the hyperparameter λ.

  3. 3.

    Now we use the cv.glmnet function that is useful for implementing supervised learning methods with cross-validation. This function belongs to the R package glmnet and the input we give to this function is the training set (X_tr, y_tr), alpha=0, that tells glmnet to implement a Ridge regression method, while alpha=1 orders glmnet to implement a Lasso regression. In foldid=Grpv_k we are given training and testing sets to tune the hyperparameter λ, and in type.measure=‘mse’, we are specifying the metric with which we will evaluate the prediction performance of the inner testing sets to be able to choose the best hyperparameter.

  4. 4.

    The function glmnet with lambda=0 implements the OLS estimator.

It is important to point out that Lasso regression performs particularly well when there is a subset of true coefficients that are small or even zero. It doesn’t do as well when all of the true coefficients are moderately large; however, in this case, it can still outperform linear regression over a pretty narrow range of (small) λ values.

3.7 Logistic Regression

The logistic regression is a useful and traditional tool used to explain or predict a binary response based on information of explanatory variables. It models the conditional distribution of the response variable as a Bernoulli distribution with the probability of success given by

$$ P\left({Y}_i=1|{\boldsymbol{x}}_i\right)=p\left({\boldsymbol{x}}_i;\boldsymbol{\beta} \right)=\frac{\exp \left({\beta}_0+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_0\right)}{1+\exp \left({\beta}_0+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_0\right)}. $$

To estimate parameters under logistic regression, suppose that we have a set of data \( \left({\boldsymbol{x}}_{\boldsymbol{i}}^{\mathrm{T}},{y}_i\right) \), i = 1, …, n (training data), where xi = (xi1, …, xip)T is a vector of features measurement and yi is the response measurement corresponding to the ith drawn individual. To obtain the MLE of β, first we need to build the likelihood function of the parameters of β. This is given by

$$ {\displaystyle \begin{array}{c}L\left(\boldsymbol{\beta}; \boldsymbol{y}\right)=\prod \limits_i^np{\left({\boldsymbol{x}}_i;\boldsymbol{\beta}\ \right)}^{y_i}{\left[1-p\left({\boldsymbol{x}}_i;\boldsymbol{\beta} \right)\right]}^{1-{y}_i}=\prod \limits_i^n{\left(\frac{p\left({\boldsymbol{x}}_i;\boldsymbol{\beta}\ \right)}{1-p\left({\boldsymbol{x}}_i;\boldsymbol{\beta}\ \right)}\right)}^{y_i}{\left[1-p\left({\boldsymbol{x}}_i;\boldsymbol{\beta} \right)\right]}^1\ \\ {}=\exp \left(\sum \limits_{i=1}^n{y}_i\left({\beta}_0+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_0\right)\right)\prod \limits_{i=1}^n\frac{1}{1+\exp \left({\beta}_0+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_0\right)}.\kern0.75em \end{array}} $$

and from here the log-likelihood is

$$ \mathrm{\ell}\left(\boldsymbol{\beta}; \boldsymbol{y}\right)=\log \left[L\left(\boldsymbol{\beta}; \boldsymbol{y}\right)\right]=\sum \limits_{i=1}^n{y}_i\left({\beta}_0+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_0\right)-\sum \limits_{i=1}^n\log \left[1+\exp \left({\beta}_0+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_0\right)\right]. $$

Then, because the gradient of the likelihood is given by

$$ {\displaystyle \begin{array}{c}\frac{\mathrm{\partial \ell}\left(\boldsymbol{\beta}; \boldsymbol{y}\right)}{\partial \boldsymbol{\beta}}=\left[\begin{array}{c}\sum \limits_{i=1}^n{y}_i\\ {}\sum \limits_{i=1}^n{y}_i{x}_{i1}\\ {}\vdots \\ {}\sum \limits_{i=1}^n{y}_i{x}_{ip}\end{array}\right]-\left[\begin{array}{c}\sum \limits_{i=1}^n\frac{\exp \left({\beta}_0+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_0\right)}{1+\exp \left({\beta}_0+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_0\right)}\\ {}\sum \limits_{i=1}^n\frac{\exp \left({\beta}_0+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_0\right)}{1+\exp \left({\beta}_0+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_0\right)}{x}_{i1}\\ {}\vdots \\ {}\sum \limits_{i=1}^n\frac{\exp \left({\beta}_0+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_0\right)}{1+\exp \left({\beta}_0+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_0\right)}{x}_{ip}\end{array}\right]\\ {}={\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{y}-{\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{p}\left(\boldsymbol{X};\boldsymbol{\beta} \right)\kern0.75em \\ {}={\boldsymbol{X}}^{\mathrm{T}}\left[\boldsymbol{y}-\boldsymbol{p}\left(\boldsymbol{X};\boldsymbol{\beta} \right)\right],\end{array}} $$

where p(X; β) = [p(x1; β), …, p(xn; β)]T, the MLE of β, \( \hat{\boldsymbol{\beta}} \), can be iteratively approximated by using the gradient descent method:

$$ {\boldsymbol{\beta}}_{t+1}={\boldsymbol{\beta}}_t+\alpha {\boldsymbol{X}}^{\mathrm{T}}\left[\boldsymbol{y}-\boldsymbol{p}\left(\boldsymbol{X};{\boldsymbol{\beta}}_t\right)\right]. $$

For inferential purposes, we have that the asymptotic distribution of \( \hat{\boldsymbol{\beta}} \) is a multivariate normal distribution with vector mean β and variance–covariance matrix, the inverse of the negative of the expected value of the Hessian of the log-likelihood, \( E\left\{{\left[-\frac{\partial \ell \left(\boldsymbol{\beta}; \boldsymbol{y}\right)}{\partial \boldsymbol{\beta} \partial {\boldsymbol{\beta}}^{\mathrm{T}}}\right]}^{-1}\right\}={\left({\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{WX}\right)}^{-1} \) (McCullagh and Nelder 1989). This is because

$$ {\displaystyle \begin{array}{c}\frac{\partial^2\mathrm{\ell}\left(\boldsymbol{\beta}; \boldsymbol{y}\right)}{\partial {\beta}_j\partial {\beta}_k}=\frac{\mathrm{\partial \ell}\left(\boldsymbol{\beta}; \boldsymbol{y}\right)}{\partial {\beta}_j}=-\sum \limits_{i=1}^n\frac{\exp \left({\beta}_0+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_0\right)}{{\left[1+\exp \left({\beta}_0+{\boldsymbol{x}}_i^{\mathrm{T}}{\boldsymbol{\beta}}_0\right)\right]}^2}{x}_{ij}{x}_{ik}\\ {}=-\sum \limits_{i=1}^np\left({\boldsymbol{x}}_i;\boldsymbol{\beta} \right)\left[1-p\left({\boldsymbol{x}}_i;\boldsymbol{\beta} \right)\right]{x}_{ij}{x}_{ik}\end{array}} $$

The Hessian of the log-likelihood is given by

$$ \frac{\partial \ell \left(\boldsymbol{\beta}; \boldsymbol{y}\right)}{\partial \boldsymbol{\beta} \partial {\boldsymbol{\beta}}^{\mathrm{T}}}=\boldsymbol{H}=-{\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{WX}, $$

where W = Diag{p(x1; β)[1 − p(x1; β)], …, p(xn; β)[1 − p(xn; β)]}.

Once the parameters have been estimated, the prediction response is obtained from the estimated probabilities: \( {\hat{y}}_o=1 \) if \( p\left({\boldsymbol{x}}_o;\hat{\boldsymbol{\beta}}\right)>0.5 \) and \( {\hat{y}}_o=0 \) if \( p\left({\boldsymbol{x}}_o;\hat{\boldsymbol{\beta}}\right)\le 0.5 \). Of course, a different threshold to 0.5 can be used and this could be considered as a hyperparameter that needs to be tuned in a similar fashion as the penalty parameter in the Ridge procedure.

It is important to point out that the minimization process of the log-likelihood can be performed using a more efficient iterative technique called the Newton–Raphson technique, which is an iterative optimization technique that uses a local quadratic approximation to the log-likelihood function. The following is the Newton–Raphson iterative equation used to search for the beta coefficients:

$$ {\boldsymbol{\beta}}_{t+1}={\boldsymbol{\beta}}_t+{\boldsymbol{H}}^{-\mathbf{1}}{\boldsymbol{X}}^{\mathrm{T}}\left[\boldsymbol{y}-\boldsymbol{p}\left(\boldsymbol{X};{\boldsymbol{\beta}}_t\right)\right], $$

where H =  − XTWX is the Hessian matrix whose elements comprise the second derivative of the log-likelihood with regard to the beta coefficients. Therefore, the inverse of the Hessian is H1 = (XTWX)1. So, the previous equation can be expressed as

$$ {\boldsymbol{\beta}}_{t+1}={\boldsymbol{\beta}}_t+{\left({\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{WX}\right)}^{-\mathbf{1}}{\boldsymbol{X}}^{\mathrm{T}}\left[\boldsymbol{y}-\boldsymbol{p}\left(\boldsymbol{X};{\boldsymbol{\beta}}_t\right)\right]. $$

It is important to recall that the Hessian is no longer constant since it depends on β through the weighting matrix W. Also, it is clear that the logistic regression does not have a closed solution due to the nonlinearity of the logistic sigmoid function. It is important to point out that if instead of maximizing the likelihood we minimize the negative of the log-likelihood, the Newton–Raphson equation for updating the beta coefficients is equal to

$$ {\boldsymbol{\beta}}_{t+1}={\boldsymbol{\beta}}_t-{\left({\boldsymbol{X}}^{\mathrm{T}}\boldsymbol{WX}\right)}^{-\mathbf{1}}{\boldsymbol{X}}^{\mathrm{T}}\left[\boldsymbol{y}-\boldsymbol{p}\left(\boldsymbol{X};{\boldsymbol{\beta}}_t\right)\right]. $$

This Newton–Raphson algorithm for logistic regression is known as the iterative reweighted least squares since the diagonal weighting matrix W is interpreted as variances. Another alternative method for estimating the beta coefficients in logistic regression is the Fisher scoring method that is very similar to the Newton–Raphson method just described, but with the difference that instead of using the Hessian (H), it uses the expected value of the Hessian matrix, E(H).

3.7.1 Logistic Ridge Regression

Like the MLR, when there is strong collinearity, the variance of the MLE is severely affected and the true effects of the explanatory variables could be falsely identified (Lee and Silvapulle 1988). In a similar fashion as for the MLR, this could be judged directly from the asymptotic covariance matrix of \( \hat{\boldsymbol{\beta}} \). Moreover, in a common prediction context, when the number of features is larger than the number of observations (p ≫ n), the matrix design is not of full column rank and can cause overfitting, affecting the expected classification error (generalization error) when using the “MLE.” One way to avoid overfitting is by replacing the MLE with a regularized MLE as the Ridge MLE estimator of MLR. This is defined as

$$ {\tilde{\boldsymbol{\beta}}}_s^R\left(\lambda \right)=\underset{{\boldsymbol{\beta}}_s}{\mathrm{argmax}}\left[\mathrm{\ell}\left({\boldsymbol{\beta}}_s;\boldsymbol{y}\ \right)-\lambda \sum \limits_{j=1}^p{\beta}_{js}^2\right], $$

where λ is a hyperparameter that has a similar interpretation as in the MLR.

In the literature, there are some algorithms that approximate the Ridge estimation. For example, Genkin et al. (2007) used a cyclic coordinate descent optimization algorithm to approximate this. The one-dimensional optimization problem involved is solved by a modified Newton–Raphson method. Another method was proposed by Friedman et al. (2008) in a more general context. Given the current values of \( {\overset{\sim }{\boldsymbol{\beta}}}_s\left(\lambda \right) \), the next update of coordinate βk is given by

$$ {\beta}_{ks}=\frac{\sum_{i=1}^n{w}_i{y}_{ij}^{\ast }{x}_{ij}}{\sum_{i=1}^n{w}_i{x}_{ij}^2+\lambda } $$

with \( {y}_{ij}^{\ast }={y}_i^{\ast }-\overset{\sim }{\mu}\left(\lambda \right)-{\sum}_{\overset{j=1}{j\ne k}}^p{x}_{ij s}{\overset{\sim }{\beta}}_{js}\left(\lambda \right) \) for k = 1, …, p, and of μ is given by

$$ \mu =\frac{\sum_{i=1}^n{w}_i{e}_i^{\ast }}{\sum_{i=1}^n{w}_i} $$

with \( {e}_i^{\ast }={y}_i^{\ast }-{\sum}_{j=1}^p{x}_{ijs}{\overset{\sim }{\beta}}_{js}\left(\lambda \right), \)where \( {y}_i^{\ast }={\overset{\sim }{\beta}}_0\left(\lambda \right)+{\boldsymbol{x}}_i^{\mathrm{T}}{\overset{\sim }{\boldsymbol{\beta}}}_{0s}\left(\lambda \right)+\frac{y_i-p\left({\boldsymbol{x}}_i;{\overset{\sim }{\boldsymbol{\beta}}}_s\left(\lambda \right)\right)}{w_i} \) and \( {w}_{\boldsymbol{i}}=p\left({\boldsymbol{x}}_i;{\overset{\sim }{\boldsymbol{\beta}}}_s\left(\lambda \right)\right)\left[1-p\left({\boldsymbol{x}}_i;{\overset{\sim }{\boldsymbol{\beta}}}_s\left(\lambda \right)\right)\right] \), i = 1, …, n, are pseudo responses and weights that change across the updates. This can be obtained by maximizing, with respect to βks, the next quadratic approximation of the penalized likelihood at the current values of βs (\( {\overset{\sim }{\boldsymbol{\beta}}}_s\left(\lambda \right) \))

$$ \mathrm{\ell}\left({\boldsymbol{\beta}}_s;\boldsymbol{y}\right)-\lambda \sum \limits_{j=1}^p\left|{\beta}_{js}\right|\approx {\mathrm{\ell}}^{\ast}\left({\boldsymbol{\beta}}_s;\boldsymbol{y}\right)-\frac{\lambda }{2}\sum \limits_{j=1}^p{\beta}_{js}^2+c, $$

where \( {\ell}^{\ast}\left({\boldsymbol{\beta}}_s;\boldsymbol{y}\right)=-\frac{1}{2}{\sum}_{i=1}^n{w}_i{\left({y}_i^{\ast }-\mu -{\sum}_{j=1}^p{x}_{ijs}{\beta}_{js}\right)}^2 \) is the quadratic approximation of (βs; y) at current values of βs, \( {\overset{\sim }{\boldsymbol{\beta}}}_s\left(\lambda \right) \), and c is a constant that does not depend on βs. More details of this implementation (glmnet R package) can be found in Friedman et al. (2008). Note that under this approximation, the beta coefficients can be updated by

$$ {\hat{\boldsymbol{\beta}}}_s^R\left(\lambda \right)={\left({\boldsymbol{X}}_s^{\mathrm{T}}\boldsymbol{W}{\boldsymbol{X}}_s+\boldsymbol{\lambda} \boldsymbol{D}\right)}^{-1}{\boldsymbol{X}}_s^{\mathrm{T}}\boldsymbol{W}\boldsymbol{y}, $$

where W = Diag(w1, …, wn) and D is the diagonal matrix as defined before.

3.7.2 Lasso Logistic Regression

The Lasso penalization can be applied to other models (Tibshirani 1996). In particular, for logistic regression, the Lasso estimator of βs is defined as

$$ {\overset{\sim }{\boldsymbol{\beta}}}_s^L\left(\lambda \right)=\underset{{\boldsymbol{\beta}}_s}{\mathrm{argmax}}{\ell}_L\left({\boldsymbol{\beta}}_s;\boldsymbol{y}\right), $$

where \( {\ell}_L\left({\boldsymbol{\beta}}_s;\boldsymbol{y}\right)=\ell \left({\boldsymbol{\beta}}_s;\boldsymbol{y}\right)-\lambda {\sum}_{j=1}^p\left|{\beta}_{js}\right| \) and is often known as the regularized Lasso likelihood. Numerical methods are also required to obtain this Lasso estimate. There are several possibilities (Genkin et al. 2007), such as non-quadratic programming and iteratively reweighted least squares (Tibshirani 1996), but here we will briefly describe the one proposed by Friedman et al. (2008) and implemented in the glmnent R package. This method consists of applying the coordinate descent procedure to a penalized reweighted least square, which is formed by making a Taylor approximation to the likelihood around the current values of the coefficients. That is, this procedure consists of successively updating the parameters by

$$ {\tilde{\boldsymbol{\beta}}}_s=\underset{{\boldsymbol{\beta}}_{\boldsymbol{s}}}{\mathrm{argmin}}\left(-{\mathrm{\ell}}^{\ast}\left({\boldsymbol{\beta}}_s;\boldsymbol{y}\right)+\frac{\lambda }{2}\sum \limits_{j=1}^p\left|{\beta}_{js}\right|\right), $$

where \( {\ell}^{\ast}\left({\boldsymbol{\beta}}_s;\boldsymbol{y}\right)=-\frac{1}{2}{\sum}_{i=1}^n{w}_i{\left({y}_i^{\ast }-\mu -{\sum}_{j=1}^p{x}_{ijs}{\beta}_{js}\right)}^2+c, \)\( {y}_i^{\ast }={\overset{\sim }{\beta}}_0\left(\lambda \right)+{\boldsymbol{x}}_i^{\mathrm{T}}{\overset{\sim }{\boldsymbol{\beta}}}_{0s}\left(\lambda \right)+\frac{y_i-p\left({\boldsymbol{x}}_i;{\overset{\sim }{\boldsymbol{\beta}}}_s\left(\lambda \right)\right)}{w_i}, \) and \( {w}_{\boldsymbol{i}}=p\left({\boldsymbol{x}}_i;{\overset{\sim }{\boldsymbol{\beta}}}_s\left(\lambda \right)\right)\left[1-p\left({\boldsymbol{x}}_i;{\overset{\sim }{\boldsymbol{\beta}}}_s\left(\lambda \right)\right)\right] \), i = 1, …, n, as defined in the Ridge regression case. More details of this implementation can be consulted in Friedman et al. (2008).

Example 5

In this example, we used data corresponding to 40 lines planted with four repetitions. For illustrative purposes, we will use as response a binary variable based on Plant Height. The matrix of features used here was obtained from the genomic relationship (G), X = ZLG1/2, where G1/2 is the square root matrix of G.

The performance of logistic regression, logistic Ridge regression, and Lasso logistic regression for this data set was evaluated across 100 random splittings of the complete data set: 20% for testing (evaluation performance) and 80% for training. The performance was measured by the proportion of cases correctly classified (PCCC) in the testing data. These results are summarized in Table 3.2, where for each method the mean (PCCC) and standard deviation (SD) of the PCCC across the 100 splittings are reported. The table indicates that, on average, the standard logistic (SLR) approach shows slightly better performance than the other two approaches, even better than the Lasso solution. Out of the 100 random partitions, 72, 24, and 4, the SLR, the logistic Ridge regression (LRR), and the logistic lasso regression (LLR) resulted in the higher PCCC value, respectively. However, the difference in the performance of the three methods is not significant because of the large deviation obtained across the different partitions.

Table 3.2 Performance of the standard, Ridge, and Lasso logistic regression models

The computations were done with the help of the glmnet R package using the following R code:

#########################R code for Example 5####################### load(file ='dat-E3.5.RData') dat_F = dat$dat_F dat_F = dat_F[order(dat_F$Rep,dat_F$GID),] head(dat_F) G = dat$G dat_F$y = dat_F$Height ZL = model.matrix(~0+GID,data=dat_F) colnames(ZL) Pos = match(colnames(ZL),paste('GID',colnames(G),sep='')) max(abs(diff(Pos))) y = dat_F$y ei = eigen(G) X = ZL%*%ei$vectors%*%diag(sqrt(ei$values))%*%t(ei$vectors) n = length(y) library(glmnet) #5FCV set.seed(1) Tab = data.frame() set.seed(1) for(k in 1:100) { Pos_tr = sample(1:n,n*0.8) y_tr = y[Pos_tr] ; X_tr = X[Pos_tr,]; n_tr = dim(X_tr)[1] y_tst = y[-Pos_tr]; X_tst = X[-Pos_tr,] #Partition for internal training the model Grpv_k = findInterval(cut(sample(1:n_tr,n_tr),breaks=5),1:n_tr) #RR A_RR = cv.glmnet(X_tr,y_tr,family='binomial', alpha=0,foldid=Grpv_k,type.measure='class') yp_RR = as.numeric(predict(A_RR,newx=X_tst,s='lambda.min',type='class')) #LR A_LR = cv.glmnet(X_tr,y_tr,family='binomial', alpha=1,foldid=Grpv_k,type.measure='class') yp_LR = as.numeric(predict(A_LR,newx=X_tst,s='lambda.min',type='class')) #SLR A_SLR = glmnet(X_tr,y_tr,family='binomial',alpha=0,lambda=0) yp_SLR = as.numeric(predict(A_SLR,newx=X_tst,type='class')) Tab = rbind(Tab,data.frame(PT=k,PCCC_RR = 1-mean(y_tst!=yp_RR), PCCC_LR = 1-mean(y_tst!=yp_LR), PCCC_SLR = 1-mean(y_tst!=yp_SLR))) cat('k = ', k,'\n') } Tab

Also, in this R code there are four relevant points:

  1. 1.

    Hundred random partitions were implemented, Pos_tr = sample(1:n,n*0.8), with 80% for training and 20% for testing, and for each training set, an inner K=5 fold CV is performed to tune the λ hyperparameter.

  2. 2.

    Also, here Grpv_k contains the information of the K=5 inner folds to tune the hyperparameter λ.

  3. 3.

    The cv.glmnet function with the following input is used: (a) training set (X_tr, y_tr); (b) with alpha=0 to implement a Ridge regression and alpha=1 to implement a Lasso regression; (c) with foldid=Grpv_k containing training and testing sets to tune the hyperparameter λ; (d) with family=‘binomial’ to implement a logistic regression; and (e) with type.measure=‘class’ as a metric for categorical data to measure the prediction performance of the inner testing set to choose the best value of the hyperparameter λ.

  4. 4.

    The function glmnet with family=‘binomial’ and lambda=0 implements the logistic regression but without penalization.