Building Supervised Statistical Machine Learning Models

A linear multiple regression model (LMRM) is a useful tool for investigating linear relationships between two or more explanatory variables (inputs, features in machine learning literature) (X) and the conditional expected value of a response E(Y/X). Due to its simplicity, adequate fitting, and easily interpretable results, this has been one of the most popular techniques for studying the association between variables. Specifically, regarding the latter task, this is a useful approach and an ideal (natural) starting point for studying more advanced methods (James et al. 2013) of association and prediction. In this chapter, we review the main concepts and approaches for fitting a linear regression model.


Fitting a Linear Multiple Regression Model via the Ordinary Least Square (OLS) Method
In a general context, we have a covariate vector X ¼ (X 1 , . . ., X p ) T and we want to use this information to predict or explain how this variable affects a real-value response Y. The linear multiple regression model assumes a relationship given by where E is a random error with mean 0, E(E) ¼ 0 and is independent of X. This error is included in the model to capture measurement errors and the effects of other unregistered explanatory variables that can help to explain the mean response.
Then, the conditional mean of this model is E YjX ð Þ ¼ β 0 þ P p j¼1 X j β j and the conditional distribution of Y given X is only affected by the information of X.
For estimating the parameters β = (β 0 , β 1 , . . ., β p ) T , usually we have a set of data x T i , y i À Á , i ¼ 1, . . ., n, often known as training data, where x i ¼ (x i1 , . . ., x ip ) T is a vector of features measurement and y i is the response measurement corresponding to the ith individual drawn. The most common method for estimating β is the least squares method (OLS) that consists of taking the β value that minimizes the residual sum of squares defined as where β 0 = ( β 1 , . . ., β p ) T , y = (y 1 , . . ., y n ) T is the vector with the response values of all individuals, and X is an n Â ( p + 1) matrix that contains the information of the measured features of all individuals, including the intercept in the first entry: If the X matrix has full column rank, then by differentiating the residual sum of squares with respect to the β coefficients, we can find the set of β parameters that minimize the RSS(β), This derivative is also known as the gradient of the residual sum of squares. Then by setting the gradient of the residual sum of squares to zero, we obtain the normal equations The solution to the normal equations is unique and gives the OLS estimator of β b β = X T X À Á À1 X T y, where super index À1 indicates the inversion matrix. From the above assumptions, we can show that this estimator is unbiased and with the additional assumption that the observation responses y i 0 s are uncorrelated and have the same variance, Var(y i ) ¼ σ 2 , we can also show that the variance-covariance matrix of this is When the input features only contain the information of a variable ( p ¼ 1), the resulting model is known as simple linear regression and can be easily visualized in the Cartesian plane. When p ¼ 2, the above multiple linear regression describes a plane in the three-dimensional space (x 1 , x 2 , y). In general, the conditional expected value of this model defines a hyperplane in the p-dimensional space of the input variables (Montgomery et al. 2012).
The fitted values corresponding to all the training individuals are where the matrix H ¼ X(X T X) À1 X T is commonly called the hat matrix. This is because the vector of the observed response values is mapped by this expression to a vector of fitted values (Montgomery et al. 2012), in this way, puts the hat on y (Hastie et al. 2009). In a similar way, a predicted value of an arbitrary individual with feature x can be obtained by An unbiased estimator for the common residual variance σ 2 is obtained by where e i ¼ y i À b y i is known as the residual of the model corresponding to the individual i, e = y 2 b y is the vector of all residual values, and I n is the identity matrix of order n Â n.
The traditional inferential and prediction analysis for this model assumes that the random error E is normally distributed with mean zero and variance σ 2 . With this we can show that the OLS of beta coefficients, b β , is a random vector distributed according to a multivariate normal distribution with vector mean β and a variancecovariance matrix, as previously defined (Montgomery et al. 2012;Hastie et al. 2009;Rencher and Schaalje 2008). Another important fact that will be described in more detail in the next section, is that under the Gaussian assumption over errors, the OLS of β coincides with the maximum likelihood estimator.
We can also show that n À p À 1 ð Þ b σ 2 =σ 2 is independent of b β and distributed according to a Chi-squared distribution with n À p À 1 degrees of freedom. Based on this and on the properties of the normal and t-student distributions, we show that for , where c jj is the ( j + 1, j + 1) elements of the matrix (X T X) À1 , are random variables with a t-student distribution with n À p À 1 degrees of freedom (t n À p À 1 ). That is, T j $ t n À p À 1 and $ stands for distributed as. From here, a 100(1 À α)% confidence interval for a particular beta coefficient, β j , is given by where t α, n À p À 1 is the α quantile of the t-student distribution with n À p À 1 degrees of freedom. Similarly, a 100(1 À α)% joint confidence region for all the beta coefficients, β, is given if these values satisfy where F pþ1 α,nÀpÀ1 denotes the α quantile of the F distribution with p + 1 and n À p À 1 degrees of freedom in the numerator and denominator, respectively (Rencher and Schaalje 2008).
In a similar way, to test a hypothesis over a specific beta coefficient, the following rule can be used: can be performed using the following rule:

Fitting the Linear Multiple Regression Model via the Maximum Likelihood (ML) Method
The maximum likelihood (ML) estimation is a more general and popular method for estimating the parameters of a model (Casella and Berger 2002). It consists of finding the parameter value that maximizes the "probability" of observed values in the sample under the adopted model.
. ., n, is a set of observations from a multiple linear regression model (3.1) with homoscedastic and uncorrelated errors, the MLE of β and σ 2 , b β and b σ 2 , of this model is defined as where L(β, σ 2 ; y, X) is the likelihood function of the parameters, which is the probability of the observed response values but viewed as a function of the parameters Then, the log(L(β, σ 2 ; y, X)) is equal to log L β, σ 2 ; y, X À Á À Á ¼ À n 2 log 2π ð Þ À n log σ ð Þ À 1 2σ 2 y À Xβ ð Þ T y À Xβ ð Þ To find the maximum of σ 2 and β, we get the derivative of log L b β, σ 2 ; y, X with regard to these parameters Now, by setting these derivatives equal to zero and solving the resulting equations for β and σ 2 , we found that the estimates of these parameters are From this we can see that for each value of σ 2 , the value of β that maximizes the likelihood is the same value that maximizes À 1 2σ 2 y 2 Xβ ð Þ T y 2 Xβ ð Þ, which in turn minimizes (y À Xβ) T (y À Xβ), which is precisely the OLS of β, b β. But when equating the derivative of log L b β, σ 2 ; y, X to zero and solving for σ 2 , the value of σ 2 that and from here, the MLE of β and σ 2 are b β and b σ 2 , because it can be shown that the values of parameters that maximize the likelihood are unique when the design matrix X is of full column rank.

Fitting the Linear Multiple Regression Model via the Gradient Descent (GD) Method
The steepest descent method, also known as the gradient descent (GD) method, is a first-order iterative algorithm for minimizing a function ( f ). It is a central mechanism in statistical learning to training models (to estimate the parameters), for example, in neuronal networks and penalized regression models (Ridge and Lasso). It consists of successively updating the argument of the objective function in the direction of the steepest descent (along the negative of the gradient of the function), that is, in the direction in which f decreases most rapidly (Haykin 2009;Nocedal and Wright 2006). Specifically, each step of this algorithm is described by where ∇f(η t ) is the gradient vector of f evaluated in the current value η t and α is a step size or learning rate parameter, which greatly determines the convergence behavior toward an optimal solution (Haykin 2009; Beysolow II 2017) and in neural networks it is popular for setting this at a small, fixed value (Warner and Misra 1996; Goodfellow et al. 2016). The learning rate parameter can be adaptative as well, that is, can be allowed to change at each step. For example, in the library Keras (see Chap. 11) that can be used for implementing and training neuronal networks models, there are several optimizers based on an adaptive gradient descendent algorithm such as Adam Adgrad, Adadelta, RMSprop, among others (Allaire and Chollet 2019). The ideal value of the step size would be the value that gives the larger reduction in each step, that is, the value of α that minimizes f(η t À α ∇ f(η t )), which in general is difficult and expensive to obtain (Nocedal and Wright 2006). Although the use of this algorithm could be avoided in an MLR, especially in small data sets, and also because of its slow convergence in linear systems (Burden and Faires 2011), here we will describe how this works when finding the optimal beta coefficients in this model. First, the gradient of the residual sum of squares is given by Then, the next update of beta coefficients in the gradient descent algorithm in this model is given by where e t = y 2 Xβ t is the vector of residuals that is obtained in the current iteration. One way to speed up the convergence of the algorithm is by choosing the ideal learning rate in each step, which, as was described before, is given by the value of α that minimizes f(η t À α ∇ f(η t )), and in this case for the MLR model is given by (Nocedal and Wright 2006): Example 1 For numerical illustration, we considered a synthetic data set that consists of 100 observations and two covariates. The scatter plots in Fig. 3.1 show how the response variable ( y) is related to the two covariates (x 1 , x 2 ). By setting a value of 10 À2 for the learning rate parameter, and as the stopping criterion a tolerance of 10 À8 for the maximum norm of the difference between the current and next vector value, the beta coefficient obtained with the GD method is b β ¼ 5:0460764, 0:8551383, 2:1903356 ð Þ . For these synthetic examples, 12 iterations were necessary, while by changing the learning rate parameter to 10 À3 , the number of iterations increased to 185, but we practically got the same results. Now, by using the "optimal" learning rate parameter described before for MLR with the same tolerance error (10 À8 ), the number of required iterations up to convergence is reduced to only 10 iterations. In general, the performance of the gradient descent depends greatly on the objective function and can be affected by the characteristics of the model, the dispersion of the data (explained variance of the predictors), and the dependence between the predictors, among others.
In the data set used in this example, the covariates are independent and the proportion of explained variances by the predictor is about 79% of the total variance of the response. By changing to a pair of moderately correlated covariates with correlation 0.75, while holding the same beta coefficient values, the variance of the residual (1.44), and the same sample size, we generate data where a greater proportion of variance is explained by the covariates (85.6%), but when applying the gradient descent with the same tolerance error (10 À8 ) as before, and learning rate values of 10 À2 and 10 À3 , the required number of iterations are about 5.75 times (69) and 3.5 times (649) the number required for the independent covariates case and the example described before, respectively.
Continuing with the last case of dependent variables, when using the optimal learning rate described before for the MLR, the number of iterations is reduced to 60, 9 less than when using the constant learning rate 10 À2 .
By multiplying the beta coefficients used before by sqrt(0.1), the proportion of explained variance by the covariates is reduced to 27.30% and 37.2% in the same independent covariate (E3) and the same correlated covariate (E4) scenarios described before, respectively. With a tolerance error of 10 À8 and with a learning rate equal to 10 À2 , the required number of iterations are 183 and 66, for scenarios E3 and E4, respectively, while for a learning rate of 10 À3 the required number of iterations are 617 and 1638. When using the "optimal" learning rate parameter, the required number of iterations is reduced to 17 and 56 for scenarios E3 and E4, respectively. The R code used for implementing the GD method is given next.

Advantages and Disadvantages of Standard Linear Regression Models (OLS and MLR)
The MLR is a simple and computationally appealing class of models, but with many predictors (relative to the sample size) or nearly dependent features, it may result in large prediction error and/or large predictive intervals (Wakefield 2013). To appreciate the latter case (nearly dependent features), consider the spectral decomposition is a diagonal matrix with the eigenvalues of X T X in decreasing order and Γ is an orthogonal matrix with columns corresponding to eigenvectors of X T X. Then the obtained variance-covariance matrix of the OLS estimator of b β can be expressed as When the features are nearly dependent, some λ j 's will be "close" to zero and consequently the variance of some b β j 0 s will be high; this is even greater when the linear dependence of the features is strong (Wakefield 2013;Christensen 2011). This strong dependence between features is a problem of the OLS in MLR that is also reflected in the quality of the prediction performance, for example, when this is measured by the conditional expected prediction error (EPE) or mean squared error prediction that for an individual with feature x o is given by This means that the average loss incurred (squared difference between the value to be predicted and the predicted value) by predicting Y 0 with its estimated mean under the MLR, x ÃT o b β, is composed of intrinsic or irreducible data noise (first term) and the variance of x ÃT o b β (second term). The former cannot be avoided no matter how well the mean value of Y 0 j x 0 , E(Y 0 | x 0 ), is estimated, and the latter increases as the dependence of features is stronger. From this, it is apparent that the EPE is also affected by the strong dependence between features, which is a problem of the OLS in an MLR in a prediction context.

Ridge Regression
Ridge regression, originally proposed as a method to combat multicollinearity, is also a common approach for controlling overfitting in an MLR model (Christensen 2011). It translates the OLS problem into the minimization of the penalized residual sum of squares defined as where λ ! 0 is known as the regularization or tuning parameter, which determines the level or degree to which the beta coefficients are shrunk toward zero. When λ ¼ 0, the OLS is the solution to the beta coefficients, but when λ is large, the PRSS λ (β) is dominated by the penalization term, and the OLS solution has to shrink toward 0 (Christensen 2011). In general, when the number of parameters to be estimated is larger than the number of observations, the estimator can be highly variable. In this situation, the intuition of Ridge regression tries to alleviate this by constraining the sum of squares for the beta coefficients. Note that PRSS λ (β) can be expressed as where D = diag (0, 1, . . ., 1) is an identity matrix of dimension ( p + 1) Â ( p + 1) but with one zero in its first entry. Then, the gradient of RSS λ (β), that is, the first derivative with regard to β of RSS λ (β), is This is a biased estimator of β because the conditional expected value is given by but as will be described later, relative to the OLS estimator, by introducing a "small" bias, the variance or/and the EPE of this method could potentially be reduced (Wakefield 2013). By using the method of Lagrange multipliers, the Ridge regression estimates of the β coefficients can be reformulated in a similar way to the OLS problem, but subject to the condition that the magnitude of the β 0 = (β 1 , . . ., β p ) T be less or equal where t(λ) is a one-to-one function that produces an equivalent definition to the penalized OLS presentation of the Ridge regression described before (Wakefield 2013;Hastie et al. 2009Hastie et al. , 2015. This constrained reformulation gives a more transparent role than the one played by the tuning parameter, and among other things, suggests a convenient and common way of redefining the Ridge estimator by standardizing the variables when these are of very different scales. A graphic representation of this constraint problem for β 0 ¼ 0 and p ¼ 2 is given in Fig. 3.2, where the nested ellipsoids correspond to contour plots of RSS(β) and the green region is the restriction with t(λ) ¼ 3 2 , which contains the Ridge solution.
The MLR defined in (3.1) but now defined with the standardized variables is expressed as where 1 n is the column vector with 1's in all its entries, X 1s ¼ Then, the redefined penalized residual sum squared under this model is The Ridge solution under this redefinition is like the one given before, but now is the expected value of the Ridge estimator of β 0s . The variance-covariance matrix is this standardized way, the Ridge solution of the intercept (μ) is the sample mean of the observed responses, and the correlation of this with the rest of the estimated parameters ( b β 0s λ ð Þ) is null, in the literature it is common to handle this parameter separately from all other coefficients (β 0s ) (Christensen 2011).
Note that is the spectral decomposition of X T 1s X 1s and β Ã 0s = Γ T s β 0s . So the conditional expected prediction error at x o when using the Ridge solution is The second and third terms of the last equality correspond to the squared bias and the variance of x ÃT o b β R s λ ð Þ as an estimator of x ÃT o β s , respectively. By setting λ ¼ 0, this EPE corresponds to the EPE of the OLS prediction but with standardized variables, while by letting λ be very large, the variance will decrease and the squared bias will increase.
More importantly, because the derivative of EPE λ (x o ) with respect to λ, d dλ PE λ x o ð Þ, is a right continuous function at λ ¼ 0, and for X 1s of full column rank, lim Then, at least in the interval [0, λ Ã ], the expected prediction error at x o shows a decreasing behavior, which indicates that there is a value of λ such that with the Ridge regression estimation of beta coefficients, we can get a smaller prediction error than with the OLS prediction.  When X 1s is not full column rank, the previous argument regarding the behavior of the EPE of the Ridge solution is already not valid directly, but it could be used for validating part of the more general case. To see this, first note that the spectral decomposition of X T 1s X 1s can be reduced to where Λ 1s = Diag λ 1 , . . . , λ p Ã À Á , p Ã ¼ rank (X 1c ) is the rank of design matrix and Λ 2s is the null matrix of order ( p À p Ã ) Â ( p À p Ã ). Furthermore, because Γ T s Γ s ¼ I p implies that Γ T 2s X T 1s X 1s Γ 2s ¼ 0, which in turn implies that X 1s Γ 2s = 0, then the MLR can be conveniently expressed by Also, from similar arguments, note that the penalized residual sum of squares of the Ridge solution can be expressed by 01s λ ð Þ: Then, in a similar fashion as before, the conditional expected prediction error at x o by using the Ridge solution in this case can be computed as and a 2 2 ℝ pÀp Ã . So, using a similar argument as before, in the first case (x o ¼ Γ 1s a 1 ), the value of λ > 0 is such that the expected prediction error at x o is better than that obtained with the OLS approach, b β s 0 ð Þ = lim In the second case, x o ¼ Γ 2s a 2 , the EPE(x o ) in both approaches is the same and doesn't depend on λ, so in such cases, no improved gain with regard to the Ridge solution is achieved. A third case was included, that is, when the target feature is of the form x o ¼ Γ 1s a 1 + Γ 2s a 2 . In this case, under the described argument, the advantage of Ridge regression over the OLS approach in a prediction context is not clear.
However, in practice, we don't know the true value of the parameters, and we need to evaluate the test error in all possible values of the training sample, which we also don't have. So a common way to choose the λ value is by cross-validation. For more details about validation strategies, see Chap. 4. For example, with a k-fold CV, the complete data set is divided into K balanced disjoint subsets, S k , k ¼ 1, . . ., K. One subset is used as validation and the rest are used to fit the model in each value of a chosen grid of values of λ. This procedure is repeated K times, where each time a subset in the partition is taken as the validation set. A more detailed k-fold CV procedure is described below: 1. First, choose a grid of values of λ, λ ¼ (λ 1 , . . ., λ L ) . 2. Remove the subset S k and for each value λ l in the grid, fit the model with the remaining K À 1 elements of the partition denoted by b β R Àk λ l ð Þ, the corresponding Ridge estimation of β, and compute the average prediction error across all observations in the validation set S k as where jS k j denotes the total observations in partition k.
3. Choose as the best value of λ in the grid e λ Ã , the one with the lower average prediction error across all partitions, that is It is important to point out that very often the performance of the model needs to be evaluated for comparison purposes with other competing models. A common way to do this is to split the data set several times into two subsets, one for training the model (D tr ) (to fit the model) and the other for testing (D tst ) it, in which the predictive ability of a model is tested. In each splitting, only the training data set (D tr ) is used to train the model (by steps 1-3 before fitting the whole training data set), and the prediction evaluation of the fitted model is made with the testing data set, as explained before in point 4. The prediction evaluation of the testing data set is done by an empirical "estimate" of the EPE, MSE ¼ 1 2 , and finally, an average evaluation of the performance of the model is obtained across all chosen splittings. See Chap. 4 for more explicit details. The Ridge solution can also be obtained from a Bayesian formulation. To do this, consider the MLR model described before with standardized features and the vector of residuals distributed as N n (0, σ 2 I n ). With this assumption, the vector of responses y is distributed in a multivariate normal distribution with vector mean 1 n μ + X 1s β 0s and variance-covariance matrix σ 2 I n . Then, to complete the Bayesian formulation, assume β os $ N p 0, σ 2 β I p as the prior distribution of the beta coefficients in β 0s and a "flat" prior for the intercept μ, where σ 2 and σ 2 β are known. Under this Bayesian specification, the posterior distribution of β s is Σ β X T s y, and D is the diagonal penalty matrix. That is, the posterior distribution of β s is a multivariate normal distribution with vector mean e β s ¼ σ À2 e Σ β X T s y ¼ : Then, by taking λ ¼ σ 2 =σ 2 β , we have that the mean/mode of the posterior distribution of β s coincides with the Ridge estimation described before, e β R λ ð Þ.
Example 2 We considered a genomic example to illustrate the Ridge regression approach and the CV process to choose the learning parameter λ (WheatMadaToy, PH the response). This data set consists of 50 observations corresponding to 50 lines and a relationship genomic matrix computed from marker information. Table 3.1 shows the prediction behavior of the Ridge and the OLS approaches in terms of the MSEP, across five different splittings obtained by partitioning the complete data set into five subsets: the data of a subset are used as a testing set and the rest to train the model. For training the model, a five-fold cross-validation (5FCV) was used along the lines following steps 1-3 described before, and the prediction performance was done following step 4. Table 3.1 indicates that in four out of five partitions, the Ridge regression shows less MSE than the corresponding OLS approach. In all these cases, the MSE of the OLS was, on average, 31.46% greater than that of the Ridge regression approach, and in general, on average, by 31.14% (MSE ¼ 421.8834 for Ridge and MSE ¼ 655.8596 for OLS). From this, we have that the Ridge regression approach shows a better prediction performance than the OLS. The large variation of the MSE between folds observed in this example could indicate that for obtaining a more precise comparison between models, a larger number of partitions need to be used. Often, the use of more partitions is avoided when larger data sets are used in applications.
From the last code, three things are important to point out: 1. The Grpv contains the information of the K¼5 folds for the outer CV implemented and each time the model is trained with K À 1 and tested with the remaining fold.

The function that trains the model under Ridge regression is called Tr_RR_f,
while the function that obtains the predictions of the testing set of this trained model is Pred_RR_f; both functions are fully described in Appendix 1.

The predictions under the OLS method are obtained with the function Pred_ols_f,
which is also fully detailed in Appendix 1. It is important to point out that the function Tr_RR_f internally implements an inner k-fold CV to tune the hyperparameter λ required in Ridge regression.

Example 3 (Simulation)
To get a better idea about the behavior of the Ridge solution, here we report the results of a small simulation study in a scenario where the number of observations (n¼100) is less than the number of features ( p ¼ 500) and these are moderately correlated. Specifically, we generated 100 data sets, each of size 100, from the following model: where the vector of beta coefficients (β 0 ) was set to the values shown in Fig. 3.4, and the features of all the individuals in each data set were generated from a multivariate normal distribution centered on the null vector and variance-covariance matrix Σ ¼ 0.25I p + 0.75J p , where I p and J p are the identity matrix and matrix of ones of dimension p Â p. The random errors (E i ) were simulated from a normal distribution with mean 0 and variance 0.025.
The behavior of the Ridge and OLS solutions across the 100 simulated data sets is shown in Fig. 3.5. The MSE of Ridge regression is located on the x-axis and the corresponding MSE of the OLS is located on the y-axis. On average, the OLS resulted in an MSE equal to 808.81, which is 30.59% larger than the average MSE (619.32) of the Ridge approach. In terms of the percentage of simulations in favor of each method, Ridge regression was better in 78 out of 100 simulations, while the OLS was better only in 22 out of 100 simulations. In general, from this small simulation study we obtained more evidence in favor of the Ridge regression method.
The R code used for obtaining this result is the following: The TR_RR.R script file is the same as the one defined in Example 2 in Appendix 1.

Lasso Regression
Like Ridge regression, the Lasso regression solves the OLS problem but penalizes the residual sum squared in a slightly different way. With the standardized variables, the Lasso estimator of β s is defined as but penalized by the sum of the absolute regression coefficients. For λ ¼ 0, the solution is the OLS, while when λ is large, the OLS solutions are shrunken toward 0 (Tibshirani 1996). Note that for any given values of β 0s , the value of μ that minimizes PRSS λ (β s ) is the sample mean of the responses, e μ ¼ 1 n P n i¼1 y i , the same as the Ridge estimator. However, the rest of the Lasso estimator of β s , β 0s , cannot be obtained analytically, so numerical methods are often used.
Although there are efficient algorithms for computing the entire regularization path for the Lasso regression coefficients (Efron et al. 2004;Friedman et al. 2008), here we will describe the coordinate-wise descent given in Friedman et al. (2007). The idea of this method is to successively optimize the PRSS λ (β s ) one parameter at a time (beta coefficient). Holding β ks , j 6 ¼ k, fixed at their current values e β js λ ð Þ, the value of β k that minimizes PRSS λ (β s ) is given by x ijs e β js λ ð Þ and S β, λ ð Þ ¼ . To obtain the Lasso estimate of β 0s , this process is repeated across all the coefficients until a convergence threshold criterion is reached. This algorithm can be implemented with the glmnet R package (Friedman et al. 2010) as part of a more general penalty regression (elastic net), which is defined as a combination of the Ridge and Lasso penalties. Due to the structure of the algorithm, this can be used on very large data sets and can benefit from sparsity in the explanatory variables (Friedman et al. 2008).
Equivalently, the Lasso estimator of beta coefficients β 0s can be defined as With this, a graphic representation of the Lasso estimator is like the Ridge (see Fig. 3.6). The nested ellipsoids correspond to contour plots of RSS(β) and the green region is the restriction with t ¼ 3 2 , which contains the Lasso solution. Indeed, the Lasso solution is the first point of the contours that touches the square, and this will sometimes be in a corner that makes some coefficients zero. Because there are no corners in Ridge regression, this will rarely happen (Tibshirani 1996). The green region contains the Lasso solution The Lasso estimator can also be derived from a Bayesian perspective. Supposing that the vector of residuals is distributed as N n (0, σ 2 I n ), like in the Ridge regression case, and assuming that the priors of β 0s are independent and identically distributed according to Laplace distribution with mean 0 and variance σ 2 β , and adopting a "flat" prior for μ, with known σ 2 and σ 2 β , the posterior distribution of β is Then, the model of the posterior distribution of β s corresponds to the Lasso estimator described before, e β L λ ð Þ. The performance of Lasso regression in terms of prediction error is sometimes comparable to Ridge regression (Hastie et al. 2009). However, as we pointed out before, and based on the nature of the restriction term, for any given value of t, only a subset of the coefficients β js is nonzero, so this gives a sparse solution (Efron et al.

2004).
Example 4 To illustrate Lasso regression, here we considered the data used in Example 2, but instead of using a five-fold cross-validation (5FCV) to explore the behavior of this, we built 100 random splittings of the complete data set: 80% for training and 20% for testing. Figure 3.7 presents a representation of the MSE of the Lasso regression (y-axis) and the MSE corresponding to Ridge regression (x-axis). In 81 out of 100 random splittings, the Ridge regression approach gives a better performance, and in this case, on average, the Lasso regression shows an MSE that is 92.13% greater than the Ridge solution. In the other cases, the Ridge was worse, on average, by 30.91%. On average across all the splittings, the performance of the Ridge regression (average ¼ 118.9726 and standard deviation ¼ 50.7193 of MSE) was superior to the Lasso (200.6021 and standard deviation ¼ 222.5494 of MSE) by 68.61%, but this was better than the OLS solution (average ¼ 1609.4635 and standard deviation ¼ 1105.4434 of MSE) by 802.32%, while the Ridge was 1352.80% better than the OLS estimate.
It is important to point out that Lasso regression performs particularly well when there is a subset of true coefficients that are small or even zero. It doesn't do as well when all of the true coefficients are moderately large; however, in this case, it can still outperform linear regression over a pretty narrow range of (small) λ values.

Logistic Regression
The logistic regression is a useful and traditional tool used to explain or predict a binary response based on information of explanatory variables. It models the conditional distribution of the response variable as a Bernoulli distribution with the probability of success given by To estimate parameters under logistic regression, suppose that we have a set of data ., x ip ) T is a vector of features measurement and y i is the response measurement corresponding to the ith drawn individual. To obtain the MLE of β, first we need to build the likelihood function of the parameters of β. This is given by and from here the log-likelihood is Then, because the gradient of the likelihood is given by Of course, a different threshold to 0.5 can be used and this could be considered as a hyperparameter that needs to be tuned in a similar fashion as the penalty parameter in the Ridge procedure. It is important to point out that the minimization process of the log-likelihood can be performed using a more efficient iterative technique called the Newton-Raphson technique, which is an iterative optimization technique that uses a local quadratic approximation to the log-likelihood function. The following is the Newton-Raphson iterative equation used to search for the beta coefficients: where H = À X T WX is the Hessian matrix whose elements comprise the second derivative of the log-likelihood with regard to the beta coefficients. Therefore, the inverse of the Hessian is H 21 = (X T WX) 21 . So, the previous equation can be expressed as It is important to recall that the Hessian is no longer constant since it depends on β through the weighting matrix W. Also, it is clear that the logistic regression does not have a closed solution due to the nonlinearity of the logistic sigmoid function. It is important to point out that if instead of maximizing the likelihood we minimize the negative of the log-likelihood, the Newton-Raphson equation for updating the beta coefficients is equal to This Newton-Raphson algorithm for logistic regression is known as the iterative reweighted least squares since the diagonal weighting matrix W is interpreted as variances. Another alternative method for estimating the beta coefficients in logistic regression is the Fisher scoring method that is very similar to the Newton-Raphson method just described, but with the difference that instead of using the Hessian (H), it uses the expected value of the Hessian matrix, E(H).

Logistic Ridge Regression
Like the MLR, when there is strong collinearity, the variance of the MLE is severely affected and the true effects of the explanatory variables could be falsely identified (Lee and Silvapulle 1988). In a similar fashion as for the MLR, this could be judged directly from the asymptotic covariance matrix of b β . Moreover, in a common prediction context, when the number of features is larger than the number of observations ( p ) n), the matrix design is not of full column rank and can cause overfitting, affecting the expected classification error (generalization error) when using the "MLE." One way to avoid overfitting is by replacing the MLE with a regularized MLE as the Ridge MLE estimator of MLR. This is defined as where λ is a hyperparameter that has a similar interpretation as in the MLR. In the literature, there are some algorithms that approximate the Ridge estimation. For example, Genkin et al. (2007) used a cyclic coordinate descent optimization algorithm to approximate this. The one-dimensional optimization problem involved is solved by a modified Newton-Raphson method. Another method was proposed by Friedman et al. (2008) in a more general context. Given the current values of e β s λ ð Þ, the next update of coordinate β k is given by x ijs e β js λ ð Þ for k ¼ 1, . . ., p, and of μ is given by . ., n, are pseudo responses and weights that change across the updates. This can be obtained by maximizing, with respect to β ks , the next quadratic approximation of the penalized likelihood at the current values of β s ( e β s λ ð Þ) where ℓ Ã β s ; y ð Þ ¼ À 1 2 P n i¼1 w i y Ã i À μ À P p j¼1 x ijs β js 2 is the quadratic approximation of ℓ(β s ; y) at current values of β s , e β s λ ð Þ, and c is a constant that does not depend on β s . More details of this implementation (glmnet R package) can be found in Friedman et al. (2008). Note that under this approximation, the beta coefficients can be updated by where W = Diag(w 1 , . . ., w n ) and D is the diagonal matrix as defined before.

Lasso Logistic Regression
The Lasso penalization can be applied to other models (Tibshirani 1996). In particular, for logistic regression, the Lasso estimator of β s is defined as where ℓ L β s ; y ð Þ ¼ ℓ β s ; y ð ÞÀλ P p j¼1 β js and is often known as the regularized Lasso likelihood. Numerical methods are also required to obtain this Lasso estimate. There are several possibilities (Genkin et al. 2007), such as non-quadratic programming and iteratively reweighted least squares (Tibshirani 1996), but here we will briefly describe the one proposed by Friedman et al. (2008) and implemented in the glmnent R package. This method consists of applying the coordinate descent procedure to a penalized reweighted least square, which is formed by making a Taylor approximation to the likelihood around the current values of the coefficients. That is, this procedure consists of successively updating the parameters by e β s ¼ argmin . ., n, as defined in the Ridge regression case. More details of this implementation can be consulted in Friedman et al. (2008).
Example 5 In this example, we used data corresponding to 40 lines planted with four repetitions. For illustrative purposes, we will use as response a binary variable based on Plant Height. The matrix of features used here was obtained from the genomic relationship (G), X = Z L G 1/2 , where G 1/2 is the square root matrix of G.
The performance of logistic regression, logistic Ridge regression, and Lasso logistic regression for this data set was evaluated across 100 random splittings of the complete data set: 20% for testing (evaluation performance) and 80% for training. The performance was measured by the proportion of cases correctly classified (PCCC) in the testing data. These results are summarized in Table 3.2, where for each method the mean (PCCC) and standard deviation (SD) of the PCCC across the 100 splittings are reported. The table indicates that, on average, the standard logistic (SLR) approach shows slightly better performance than the other two approaches, even better than the Lasso solution. Out of the 100 random partitions, 72, 24, and 4, the SLR, the logistic Ridge regression (LRR), and the logistic lasso regression (LLR) resulted in the higher PCCC value, respectively. However, the difference in the performance of the three methods is not significant because of the large deviation obtained across the different partitions.