Abstract
Linear regression models are widely used in statistics, machine learning and system identification. They allow to face many important problems, are easy to fit and enjoy simple analytical properties. The simplest method to fit linear regression models is least squares whose systematic treatment is available in many textbooks, e.g., [35, Chap. 4], [12]. Linear regression models can be fitted also in different way and a class of methods that we will consider in this chapter is the socalled regularized least squares. It is an extension of least squares which minimizes the sum of the square loss function and a regularization term. This latter can take various forms, leading to several variants which have been applied extensively in theory as well as in practical applications. In this chapter, we will focus on these methods and introduce their fundamentals. In the first part of the appendix to this chapter, we also report some basic results of linear algebra useful for the reading.
Download chapter PDF
3.1 Linear Regression
Regression theory is concerned with modelling relationships among variables. It is used for predicting one dependent variable based on the information provided by one or more independent variables. In linear regression, the relationship among variables is given by linear functions. To illustrate this, we start from the function estimation problem because it is intuitive and easy to understand.
The aim of function estimation is to reconstruct a function \(g:{\mathbb R}^n\rightarrow {\mathbb R}\) with \(n\in \mathbb N\) from a collection of N measured values of g(x) and x which we denote, respectively, by \(y_i\) and \(x_i\) for \(i=1,\ldots , N\). For generic values of x, the estimate \(\hat{g}\) should give a good prediction \(\hat{g}(x)\) of g(x). The variables x and g(x) are often called the input and the output variable or simply the input and the output, respectively. The collection of measured values of x and g(x), given by the couples \(\{x_i,y_i\}\), is called the data set or also the training set. In practical applications, the measurement \(y_i\) is often not precise and subject to some disturbance, i.e., for a given input \(x_i\) there is often discrepancy between \(g(x_i)\) and its measured value \(y_i\). To describe this phenomenon, it is natural to introduce a disturbance variable \(e\in {\mathbb R}\) and assume that, for any given \(x\in {\mathbb R}^n\), the measured value of g(x) is
Hence, y is the measured output and g(x) is the noisefree or true output. Accordingly, the training data \(\{x_i,y_i\}_{i=1}^N\) are collected as follows:
We are interested in linear regression models for estimation of g. For illustration, an example is now introduced.
Example 3.1
(Polynomial regression) We consider \(g:[0,1]\rightarrow {\mathbb R}\) and assume that such function is smooth. Then, g can be well approximated by polynomials with a certain order. In this case, a linear regression model for the function estimation problem takes the following form:
where \(\theta _k\in {\mathbb R}\) for \(k=1,\ldots ,n\). Defining
where, for a realvalued matrix A, the notation \(A^T\) denotes its matrix transpose, we rewrite (3.3) as
obtaining a more compact expression. \(\square \)
Although (3.5) is derived from Example 3.1, it is the general linear regression model studied in the theory of regression. For convenience, we remove the dependence of \(\phi (x_i)\) on \(x_i\) and simply write \(\phi (x_i)\) as \(\phi _i\), when the context is clear. In addition, all the vectors are column vectors. Then, model (3.5) becomes
In what follows, we will focus on (3.6) and introduce the linear regression problem, the methods of least squares and regularized least squares. We will call \(y_i\in {\mathbb R}\) the measured output, \(\phi _i\in {\mathbb R}^n\) the regressor, \(\theta \in {\mathbb R}^n\) the model parameter, n the model order, and \(e_i\) the measurement noise.
Before proceeding, it should be noted that the choice of the model order n is a critical problem in practical applications. The rule of thumb is to set n to a large enough value such that g can be represented by the proposed model structure. In system identification, this corresponds to introducing a model structure flexible enough to contain the true system. Consider, e.g., Example 3.1 again and assume that the function g is actually a polynomial of order 5. Clearly, if the dimension of \(\theta \) does not satisfy \(n \ge 6\), then \(x^5\) cannot be represented and some model bias will affect the estimation process. However, the order n should not be chosen larger than necessary, because this can increase the variance of the model estimate. This problem is actually the same as model selection complexity in the classical system identification and is connected with the biasvariance tradeoff illustrated in the first two chapters and also discussed in more detail shortly.
Also in light of the above discussion, we often assume that the model order n is either large enough for g to be adequately represented by the proposed model or even that a true model parameter that has generated the data exists, denoted by \(\theta _0\in {\mathbb R}^n\). Hence, we can formulate linear regression as the problem of obtaining an estimate \(\hat{\theta }\) such that, given a new regressor \(\phi \in {\mathbb R}^n\), the prediction \(\phi ^T \hat{\theta }\) is close to \(\phi ^T \theta _0\).
3.2 The Least Squares Method
There are many methods to estimate \(\theta \) in the linear regression model (3.6). In this section, we consider the least squares (LS) method.
3.2.1 Fundamentals of the Least Squares Method
Given the data \(y_i,\phi _i\) for \(i=1,\ldots ,N\), one way to estimate \(\theta \) is to minimize the least squares (LS) criterion:
where \(l(\theta )\) is the LS criterion and \(\hat{\theta }^{\text {LS}}\) is the LS estimate of \(\theta \). Then, the predicted output \(\hat{y}\) for the value of \(\phi ^T \theta _0\) with \(\phi \in {\mathbb R}^n\) is obtained as
3.2.1.1 Normal Equations and LS Estimate
The LS estimate \(\hat{ \theta }^{\text {LS}}\) given by (3.7) has a closedform expression. To see this, note that the first and secondorder derivatives of \(l(\theta )\) with respect to \(\theta \) are
where \(A\succcurlyeq 0\) means that A is a positive semidefinite matrix. Then all \(\hat{\theta }^{\text {LS}}\) that satisfy
are global minima of \(l(\theta )\). The set of Eqs. (3.10) is known as the normal equations. For the time being, we assume that \(\sum _{i=1}^N \phi _i\phi _i^T\) is full rank.^{Footnote 1} Then
3.2.1.2 Matrix Formulation
It is often convenient to rewrite the LS method in matrix form. To this goal, let
We can then rewrite (3.6) with the \(\theta _0\) that generated the data, the LS criterion (3.7), the normal Eqs. (3.10) and the LS estimate (3.11) in matrix form, respectively:
where \(\Vert \cdot \Vert _2\) is the Euclidean norm, i.e., the 2norm, and \(\varPhi \) is called the regression matrix.
3.2.2 Mean Squared Error and Model Order Selection
3.2.2.1 Bias, Variance, and Mean Squared Error of the LS Estimate
We study the linear regression problem in a probabilistic framework, assuming that data are generated according to (3.13) and that
Due to this assumption, the LS estimator \(\hat{\theta }^{\text {LS}}\), as well as any estimator of \(\theta \) dependent on the data, becomes random variables. Then, it is interesting to study the statistical properties of \(\hat{\theta }^{\text {LS}}\), such as the bias, variance and mean squared error (MSE).
All the expectations reported below are computed with respect to the noises \(e_i\) with the regressors \(\phi _i\) assumed to be deterministic. Simple calculations lead to
where \(\text {Cov}(\hat{\theta }^{\text {LS}},\hat{\theta }^{\text {LS}})\) is the covariance matrix of \(\hat{\theta }^{\text {LS}}\) and \(\text {MSE}(\hat{\theta }^{\text {LS}}, \theta _0)\) is the MSE matrix of \(\hat{\theta }^{\text {LS}}\) function of the true model parameter \(\theta _0\).
3.2.2.2 Model Order Selection
The issue of model order selection is essentially the same as that of model complexity selection in the classical system identification scenario. Therefore, the techniques introduced in Sect. 2.4.3 can be used to choose the model order n, e.g., Akaike’s information criterion (AIC) [1], the Bayesian Information criterion (BIC) or Minimum Description Length (MDL) approach [25, 39].
The quality of the LS estimate \(\hat{\theta }^{\text {LS}}\) depends on the adopted model order n. In practical applications, model complexity is in general unknown and needs to be determined from data. As the model order n gets larger, the fit to the data \(\Vert Y\varPhi \hat{\theta }^{\text {LS}}\Vert _2^2\) in (3.14) will become smaller, but the variances along the diagonal of the MSE matrix (3.18d) of \(\hat{\theta }^{\text {LS}}\) will become larger at the same time. When assessing the quality of \(\hat{\theta }^{\text {LS}}\), one way to account for the increasing variance is to introduce criteria that suitably modify the plain data fit. AIC and BIC are techniques following this idea and can be used for model order selection. More specifically, besides (3.17), further assuming that the errors are independent and Gaussian, i.e.,
with known noise variance \(\sigma ^2\), we obtain
where the minimization also takes place over a family of model structures with different dimension n of \(\theta \).
Another way is to estimate the prediction capability of the model on some unseen data which are not used for model estimation. As briefly seen in Sect. 2.6.3, crossvalidation (CV) exploits this idea and is among the most widely used techniques for model selection. Recall that hold out CV is the simplest form of CV with data divided into two parts. One part is used to estimate the model with different model orders and the other part is used to assess the prediction capability of each model through the prediction score \(\Vert Y_{\text {v}}\varPhi _{\text {v}}\hat{\theta }^{\text {LS}}\Vert _2^2\). Here, \(Y_\text {v},\varPhi _\text {v}\) are the validation data which are different from those used to derive \(\hat{\theta }^{\text {LS}}\). The model order giving the best prediction score will be chosen.
The noise variance \(\sigma ^2\) of the measurement noises \(e_i\) plays an important role in statistical modelling, e.g., in the assessment of the variance of \(\hat{\theta }^{\text {LS}}\) and in the model order selection using, e.g., AIC (3.20) or BIC (3.21). In practical applications, the noise variance \(\sigma ^2\) is in general unknown and needs to be estimated from the data Y and \(\varPhi \). It can be estimated in different ways based on the maximum likelihood estimation (MLE) method or the statistical property of \(\hat{\theta }^{\text {LS}}\).
Under (3.17) and the Gaussian assumption (3.19), the ML estimate of \(\sigma ^2\), as given in [25, p. 506], is
Using only assumption (3.17), an unbiased estimator of \(\sigma ^2\), as given in [25, p. 554], turns out
AIC and BIC were reported, respectively, in (3.20) and (3.21) assuming known noise variance. When \(\sigma ^2\) is unknown, the use of the ML estimate (3.22) leads to the widely used AIC and BIC for Gaussian innovations, e.g., [25, pp. 506–507]:
Example 3.2
(Polynomial regression using LS and discrete model order selection) We apply the LS method and the model order selection techniques to polynomial regression as sketched in Example 3.1. Let the function g be
Then, we generate the data as follows:
where \(x_1=0,x_{40}=1\), the \(x_2,\ldots ,x_{39}\) are evenly spaced points between \(x_1\) and \(x_{40}\), and the noises \(e_i\) are i.i.d. Gaussian distributed with zero mean and standard deviation 0.034. The function g and the generated data are shown in Fig. 3.1.
The function g is smooth and can be well approximated by polynomials. However, it is unclear which order should be chosen. Hence, we test the values \(n=1,\ldots ,15\) and, for each order n, we form the regressor (3.4), the linears regression model (3.13) and derive the LS estimate \(\hat{\theta }^{\text {LS}}\). As shown in Fig. 3.2, as the order n increases the data fit \(\Vert Y\varPhi \hat{\theta }^{\text {LS}}\Vert _2^2\) keeps decreasing.
For model order selection, we use AIC (3.24), BIC (3.25) and hold out CV with \(x_i,y_i\), \(i=1,3,\ldots ,39\) for estimation and \(x_i,y_i\), \(i=2,4,\ldots ,40\) for validation. Figure 3.3 plots the values of AIC (3.24), BIC (3.25) and the prediction score of hold out CV. The order n selected by AIC and BIC are the same and equal to 3 while that selected by hold out CV is 7.
To evaluate the performance of models of different complexity, we compute the fit measure
Note that \(\mathscr {F}=100\) means a perfect agreement between g(x) and the corresponding estimate. The model fits for \(n=1,\ldots ,15\) are shown in Fig. 3.4: the order \(n=3\) gives the best prediction. Figure 3.5 plots the estimates of g(x) for \(n=3,7,15\) over the \(x_i\), \(i=1,\ldots ,40\). Overfitting occurs when \(n=15\), indicating that the corresponding model is too flexible and fooled by the noise. \(\square \)
3.3 IllConditioning
3.3.1 IllConditioned Least Squares Problems
When \(\varPhi \in {\mathbb R}^{N\times n}\) with \(N\ge n\) is rank deficient, i.e., \({{\,\mathrm{rank}\,}}( \varPhi ) < n\), or “close” to rank deficient, the corresponding LS problem is said to be illconditioned. Examples were already encountered in Sect. 1.1.2 to discuss some limitations of the James–Stein estimators and in Sect. 1.2 in the context of FIR models. There are different ways to handle illconditioned LS problems. Below, we show how to calculate \(\hat{\theta }^{\text {LS}}\) more accurately by using the singular value decomposition (SVD).
3.3.1.1 Singular Value Decomposition
SVD is a fundamental matrix decomposition. Any matrix \(\varPhi \in {\mathbb R}^{N\times n}\), with \(N\ge n\) to simplify the exposition, can be decomposed as follows:
where \(\varLambda \) is a rectangular diagonal matrix with nonnegative diagonal entries \(\sigma _i\), \(i=1,\ldots ,n\) and \(U\in {\mathbb R}^{N\times N}\) and \(V\in {\mathbb R}^{n\times n}\) are orthogonal matrices, i.e., such that \(U^TU=UU^T=I_N\) and \(V^TV=VV^T=I_n\). The factorization (3.29) is called the singular value decomposition of \(\varPhi \) and the \(\sigma _i\) are called the singular values of \(\varPhi \). Without loss of generality, they can be assumed to be ordered according to their magnitude:
Since \(\varPhi ^T \varPhi = V \varLambda ^T \varLambda V^T = V D^2 V^T\), where D is a square diagonal matrix whose diagonal entries are the \(\sigma _i\), it follows that
where \(\lambda _i(A)\) denotes the ith eigenvalue of the matrix A.
3.3.1.2 Condition Number
The condition number of a matrix is a measure of how “close” is the matrix to rank deficient. When \(\varPhi \) is an invertible square matrix, it is denoted by \({{\,\mathrm{cond}\,}}(\varPhi )\) below and defined as
where \(\Vert \cdot \Vert \) is a matrix norm, with the convention that \({{\,\mathrm{cond}\,}}(\varPhi )=\infty \) for singular \(\varPhi \). For a generic \(\varPhi \in {\mathbb R}^{N\times n}\), with SVD in the form (3.29), its condition number with respect to the 2norm \(\Vert \cdot \Vert _2\) is defined as
where \(\sigma _{\text {max}}=\sigma _1\) and \(\sigma _{\text {min}}=\sigma _n\) are the largest and smallest singular values of \(\varPhi \), respectively. If we use the 2norm \(\Vert \cdot \Vert _2\) in (3.31), then (3.31) coincides with (3.32). Hereafter, the condition number of a matrix will be defined by (3.32).
3.3.1.3 IllConditioned Matrix and LS Problem
The condition number of a matrix is important since it can be used to measure the sensitivity of the LS estimate to perturbations in the data. To be specific, let \(\varPhi \in {\mathbb R}^{N\times n}\) be full rank and let \(\delta Y\) denote a small componentwise perturbation in Y. The solution of the perturbed LS criterion becomes
Then, it can be shown, e.g., [17, Chap. 5], [10, Chap. 3], that
So, the relative error bound depends on \({{\,\mathrm{cond}\,}}(\varPhi )\): the larger \({{\,\mathrm{cond}\,}}(\varPhi )\), the larger the relative error. One can thus say that the matrix \(\varPhi \) (and the LS problem) with a small condition number is well conditioned, while the matrix \(\varPhi \) (and the LS problem) with a large condition number is illconditioned. The condition number enters also more complex bounds on the relative error due to perturbations on the matrix \(\varPhi \) [10, 17].
Example 3.3
(Effect of illconditioning on LS) Consider the linear regression model (3.13). Let
The two singular values of \(\varPhi \) are \(\sigma _{\text {max}}=1\) and \(\sigma _{\text {min}}=5\times 10^{9}\), implying that \({{\,\mathrm{cond}\,}}(\varPhi )=2\times 10^{8}\). Thus, \(\varPhi \) and the LS problem (3.14) are illconditioned.
Using the normal Eq. (3.15), we obtain the LS estimate \(\hat{\theta }^{LS }_1\) in closed form:
Now, suppose that there is a small perturbation \(\delta Y\) in Y
Solving the normal Eq. (3.15) with Y replaced by \(Y+\delta Y\) now gives
So, when the LS problem (3.14) is illconditioned, a small perturbation in Y could cause a significant change in the LS estimate derived by solving the normal Eq. (3.15) directly. \(\square \)
Example 3.4
(Polynomial regression: illconditioned LS Problem) We revisit the polynomial regression Examples (3.26) and (3.27) stressing the dependence of the condition number on the polynomial complexity. In particular, Fig. 3.6 shows that the illconditioning of the regression matrix \(\varPhi \) constructed according to (3.4) and (3.12) augments as the dimension n increases. This further points out the importance of a careful selection of the discrete model order to control the estimator’s variance when using LS. \(\square \)
3.3.1.4 LS Estimate Exploiting the SVD of \(\varPhi \)
In order to obtain more accurate LS estimates for illconditioned problems, one can use the SVD of \(\varPhi \). Given \(\varPhi \in {\mathbb R}^{N\times n}\) with \(N\ge n\), we consider two cases:

\(\varPhi \) is rank deficient, i.e., \({{\,\mathrm{rank}\,}}(\varPhi )<n\).

\(\varPhi \) is full rank but has a very large condition number, i.e., \({{\,\mathrm{rank}\,}}(\varPhi )=n\) but \({{\,\mathrm{cond}\,}}(\varPhi )\) is very large.
For the rankdeficient case, we assume without loss of generality that \({{\,\mathrm{rank}\,}}(\varPhi )=m<n\). In this case, the LS problem does not have a unique solution. To get a special solution, we have to impose extra conditions on the solutions of the LS problem.
Let the singular value decomposition of \(\varPhi \) be
where \(\varLambda _1\in {\mathbb R}^{m \times m}\) is diagonal and positive definite while \(U_1 \in {\mathbb R}^{N\times m}\) and \(V_1 \in {\mathbb R}^{n\times m}\).
We now perform a change of coordinates in both the output and parameter space
Note that both \(\tilde{Y}_1\) and \(\tilde{\theta }_1\) are mdimensional vectors. In the new coordinates, the residual vector is
The LS criterion can be rewritten as
and is minimized by
where \(\tilde{\theta }_2 \in {\mathbb R}^{nm}\) is an arbitrary vector. To get the minimum norm solution, one can set \(\tilde{\theta }_2=0\) that, turning back to the original coordinates, yields
Interestingly, for the rankdeficient case, the special solution (3.41) relates to the Moore–Penrose pseudoinverse of \(\varPhi \), defined as
So, given a matrix \(\varSigma \), its pseudoinverse \(\varSigma ^+\) is obtained by replacing all the nonzero diagonal entries by their reciprocal and transposing the resulting matrix. When \({{\,\mathrm{rank}\,}}(\varPhi )=n\), the pseudoinverse returns the usual (unique) LS solution
It follows that the minimum norm solution among the general solutions of the LS problem (3.14) can be always written as
For the rankdeficient case, due to roundoff errors, \(\varPhi \) may have some very small computed singular values other than the m singular values contained in \(\varLambda _1\) in (3.39). The situation is similar to the case where \(\varPhi \) is full rank but with a very large condition number. Note also that the rank of \(\varPhi \) needs to be known beforehand to compute the SVD of \(\varPhi \). However, numerical determination of the rank of a matrix is nontrivial (and out of scope of this book). Here, we just mention a simple way to deal with these issues by using the socalled truncated SVD.
Consider the SVD (3.39) and, without loss of generality, assume
Now set \(\hat{\sigma }_i=\sigma _i\) if \(\sigma _i>tol \) and \(\hat{\sigma }_i=0\) otherwise. Then
where \(\hat{ \varLambda } \in {\mathbb R}^{N\times n}\) is diagonal with entries \(\hat{ \sigma }_1,\hat{ \sigma }_2,\ldots , \hat{ \sigma }_n\), is called the truncated SVD of \(\varPhi \). So, the truncated SVD (3.42) can be used to handle the case where \(\varPhi \) has full rank but large condition number: for a given tol, it suffices to replace \(\varPhi \) with \(\hat{ \varPhi }\) and then to compute the LS estimate of \(\theta \) by means of \(\hat{ \varPhi }^+ Y\).
Example 3.5
(Truncated SVD) We revisit Example 3.3 by making use of the truncated SVD of \(\varPhi \). We take the usersupplied measure of uncertainty tol to be 1e7. Then the LS estimate \(\hat{\theta }^{LS }_3\) computed by (3.41) with Y replaced by \(Y+\delta Y\) becomes
One can thus see that the estimate is now very close to \([1 \ 1 ]^T\) which was the one obtained in absence of the perturbation \(\delta Y\). \(\square \)
3.3.2 IllConditioning in System Identification
In Sect. 1.2 we have illustrated an illconditioned system identification problem. Below, we will see that the difficulty was due to the fact that lowpass filtered inputs may induce regression matrices with large \(\text {cond}(\varPhi )\).
Consider the FIR model of order n:
which can be written in the form (3.13) as follows:
Then we have
Since \(\text {cond}(\varPhi ^T\varPhi )=(\text {cond}(\varPhi ))^2\), we study \(\text {cond}(\varPhi ^T\varPhi )\) in what follows. In addition, while so far we have assumed deterministic regressors, now we work in a more structured probabilistic framework where the system input is a stochastic process. This implies that \(\varPhi \) is a random matrix. In particular, u(t) is filtered white noise, with the filter assumed to be stable and given by
Hence,
where v(t) is zeromean white noise of variance \(\sigma ^2\) with bounded fourth moments. It comes that u(t) is a zeromean stationary stochastic process with covariance function \(k_u(t,s)={\mathscr {E}}[u(t)u(s)]=R_u(ts)\) with \(R_u(\tau )\) defined as follows:
From the ergodic theory, e.g., [25, Theorem 3.4], it also follows that
From (3.46) and (3.48), one obtains the following almost sure convergence:
So, \(\lim _{N\rightarrow \infty }\frac{1}{N}\varPhi ^T\varPhi \) is the covariance matrix of \(\left[ \begin{array}{ccc} u(1) &{} \ldots &{} u(n) \\ \end{array} \right] ^T \) whose condition number thus provides insights on the illconditioning affecting the system identification problem.
Since the covariance matrix is real and symmetric, its condition number is the ratio between the largest and the smallest of its eigenvalues. An important result of O. Toeplitz, e.g., [44], [20, Chap. 5], says that as \(n\rightarrow \infty \), the eigenvalues of the covariance matrix of the infinitedimensional vector \(\left[ \begin{array}{ccc} u(1) &{} u(2) &{} \ldots \\ \end{array} \right] ^T\) coincide with the set of values assumed by the power spectrum of u(t), which is given by
Hence, considering also that \(\varPsi _u(\omega )=\varPsi _u(\omega )\), one has
In addition, since u(t) is a filtered white noise (3.47) and H(q) is stable, one also has [see, e.g., [25, p. 37] for details]:
where \(H(e^{i\omega })\) is the frequency function of the filter H(q), i.e.,
Finally, combining the results (3.49)–(3.53) yields
When the maximum of \(H(e^{i\omega })\) is significantly larger than the minimum of \(H(e^{i\omega })\), the matrix \(\lim _{n\rightarrow \infty }\lim _{N\rightarrow \infty }\frac{1}{N}\varPhi ^T\varPhi \) could be very illconditioned. For instance, if we consider the stable filter
then one has
As a varies from 0.01 to 0.99, input power is more concentrated at low frequencies and the illconditioning affecting the system identification problem augments. In fact, the above quantity increases from about 1 to \(1.6\times 10^9\).
3.4 Regularized Least Squares with Quadratic Penalties
One way to handle illconditioning is to use regularized least squares (ReLS). Such method will play a special role in this book to control overfitting by encoding prior knowledge. First insights on these aspects are provided below.
ReLS adds a regularization term \(J(\theta )\) into the LS criterion (3.14), yielding the following problem:
where \(\gamma \ge 0\) is often called the regularization parameter. It has to balance the adherence to the data \(\Vert Y\varPhi \theta \Vert _2^2\) and the penalty \(J(\theta )\). There are many choices for the regularization term which can be connected with the prior knowledge on the true model parameter \(\theta _0\) that needs to be estimated.
In this section, we consider regularization terms \(J(\theta )\) which are quadratic functions of \(\theta \). The resulting estimator will be denoted by ReLSQ in this chapter. In particular, we let \(J(\theta )=\theta ^TP^{1}\theta \) so that the ReLS criterion (3.57) becomes
where \(P\in {\mathbb R}^{n\times n}\) is a positive semidefinite matrix, here assumed invertible, often called the regularization matrix, and \(I_n\) is the ndimensional identity matrix.^{Footnote 2}
Remark 3.1
The regularization matrix P could be singular. In this case, (3.58a) is not well defined but, with a suitable arrangement, we can use the Moore–Penrose pseudoinverse \(P^+\) instead of \(P^{1}\). In particular, let the SVD of P be
where \(\varLambda _P\) is a diagonal matrix with the positive singular values of P as diagonal elements and \(U=\begin{bmatrix}U_1&U_2\end{bmatrix}\) is an orthogonal matrix with \(U_1\) having the same number of columns as that of \(\varLambda _P\). Recall also that \(P^+= U_1\varLambda _P^{1}U_1\). In order to find how (3.58a) should be modified for singular P, let us consider
By replacing P with \(P_{\varepsilon }\) in (3.58a), we obtain
If we let \(\varepsilon \rightarrow 0\), it follows that the parameter vector must satisfy \(U_2^T\theta =0\). Therefore, we may conveniently associate to a singular P the modified regularization problem
If \(P^{1}\) is replaced by \(P^+\), it is easy to verify that (3.58c) or (3.58d) is still the optimal solution of (3.60). Instead, this does not hold for (3.58b). For convenience, we will use (3.58a) in the sequel and refer to (3.60) for its rigorous meaning.
3.4.1 Making an IllConditioned LS Problem Well Conditioned
The ReLSQ can make the illconditioned LS problem well conditioned. Consider ridge regression which, as discussed in Sect. 1.2, corresponds to setting \(P=I_n\), hence obtaining
The parameter \(\gamma \) directly affects the condition number of \((\varPhi ^T\varPhi +\gamma I_{n})\) whose inverse defines the regularized estimate. In fact, the positive definite square matrix \((\varPhi ^T\varPhi +\gamma I_{n})\) has eigenvalues (coincident with its singular values) equal to \(\sigma _i^2 + \gamma \). Therefore,
which can be adjusted by tuning the regularization parameter \(\gamma \). This means that regularization can make the LS problem well conditioned even when \(\varPhi \) is rank deficient: if the smallest singular value is null one has
3.4.1.1 Mean Squared Error
Simple calculations of expectations with respect to the errors \(e_i\), with the regressors \(\phi _i\) assumed to be deterministic, lead to
where \(\text {Cov}(\hat{\theta }^{\text {R}},\hat{\theta }^{\text {R}})\) is the covariance matrix of \(\hat{\theta }^{\text {R}}\) and \(\text {MSE}(\hat{\theta }^{\text {R}},\theta _0) \) is the MSE matrix of \(\hat{\theta }^{\text {R}}\) function of the true model parameter \(\theta _0\). Expression (3.62) shows clearly regularization’s influence on the statistical properties of \(\hat{\theta }^{\text {R}}\):

when \(\gamma =0\), i.e., there is no regularization, \(\hat{\theta }^{\text {R}}\) reduces to \(\hat{\theta }^{\text {LS}}\) and \(\text {MSE}(\hat{\theta }^{\text {R}},\theta _0)\) reduces to \(\sigma ^2(\varPhi ^T\varPhi )^{1}\);

when \(\gamma >0\), the regularized estimator \(\hat{\theta }^{\text {R}}\) is biased and the MSE matrix of \(\hat{\theta }^{\text {R}}\) is decomposed into two components: the bias \(\hat{\theta }_{\text {bias}}^{\text {R}}(\hat{\theta }_{\text {bias}}^{\text {R}})^T\) and the variance \(\text {Cov}(\hat{\theta }^{\text {R}},\hat{\theta }^{\text {R}})\). By a suitable choice of the regularization matrix P and the regularization parameter \(\gamma \), the variance of \(\hat{\theta }^{\text {R}}\) can be made “smaller” and, if the resulting increase in the bias is moderate, an MSE matrix “smaller” than that associated to LS can be obtained.
3.4.2 Equivalent Degrees of Freedom
For a given regularization matrix P, we have seen (also deriving the structure of the MSE) that the regularization parameter \(\gamma \) controls the influence of the regularization: as \(\gamma \) varies from 0 to \(\infty \), the influence of the regularization \(\theta ^TP^{1}\theta \) becomes stronger. In particular, when \(\gamma =0\) there is no regularization and \(\hat{\theta }^{\text {R}}\) reduces to \(\hat{\theta }^{\text {LS}}\). When \(\gamma =\infty \) the regularization term \(\gamma \theta ^TP^{1}\theta \) overwhelms the data fit \(\Vert Y\varPhi \theta \Vert _2^2\) and one has \(\hat{\theta }^{\text {R}}=0\).
Often, it is more convenient to exploit a normalized measure of the influence of the regularization instead of considering directly the value of \(\gamma \). For this goal, we introduce the socalled influence or hat matrix:
Such matrix is important since it connects the measured output Y with the predicted output \(\hat{Y} = \varPhi \hat{\theta }^{\text {R}}\), i.e., one has
It is also important since its trace is indeed a normalized measure of the influence of the regularization. To see this, let \(A=\varPhi P \varPhi ^T\) and consider its SVD
where \(UU^T=I\) and D is a diagonal matrix with nonnegative entries \(d_i^2\). Then,
Since U is orthogonal, one has \({{\,\mathrm{trace}\,}}(UMU^T)={{\,\mathrm{trace}\,}}(M)\), so that
The above equation implies that \({{\,\mathrm{trace}\,}}(H)\) is a monotonically decreasing function of \(\gamma \). It attains its maximum at \(\gamma =0\) and infimum as \(\gamma \rightarrow \infty \). In particular, for \(\gamma =0\) one has \(\hat{\theta }^{\text {R}}=\hat{\theta }^{\text {LS}}\) and the hat matrix H becomes \(H=\varPhi (\varPhi ^T\varPhi )^{1}\varPhi ^T\), implying that \({{\,\mathrm{trace}\,}}(H)=n\) if \(\varPhi \) is full rank. For \(\gamma \rightarrow \infty \) one instead has \({{\,\mathrm{trace}\,}}(H)\rightarrow 0\). Therefore, it holds that \(0<{{\,\mathrm{trace}\,}}(H)\le n\). Hence, since n is the dimension of \(\theta \), i.e., the number of parameters in the linear regression model, \({{\,\mathrm{trace}\,}}(H)\) can be seen as the counterpart of the number of parameters to be estimated in the LS context. In other words, in the regularized framework \({{\,\mathrm{trace}\,}}(H)\) plays the role of the model order. It thus becomes natural to call it the equivalent degrees of freedom for the ReLSQ estimate \(\hat{\theta }^{\text {R}}\), e.g., [21, Sect. 7.6], [4, p. 559]:
The notation \({{\,\mathrm{dof}\,}}(\gamma )\) will be also used in the book in place of \({{\,\mathrm{dof}\,}}(\hat{\theta }^{\text {R}})\) to stress the dependence of the equivalent degrees of freedom on the regularization parameter.
Example 3.6
(Polynomial regression: ridge regression) As shown in Fig. 3.6, the regression matrix \(\varPhi \) built in the polynomial regression Example (3.26) and (3.27) is illconditioned for large n. Here, we consider the case \(n=16\) (corresponding to a polynomial order 15) which leads to \({{\,\mathrm{cond}\,}}(\varPhi )=1.49\times 10^{11}\). To illustrate how ridge regression (3.61) can face the illconditioning, let \(\gamma =\gamma _i\), \(i=1,\ldots ,16\), with \(\gamma _1=0.01\) and \(\gamma _{16}=0.31\) and \(\gamma _2,\ldots ,\gamma _{15}\) evenly spaced between \(\gamma _1\) and \(\gamma _{16}\). For each \(\gamma _i\), we then compute the corresponding ridge regression estimate (3.61) and plot the 16 estimates \(\hat{g}(x) = \phi (x)^T\hat{\theta }^{\text {R}}\) in Fig. 3.7. The fits (3.28) are shown in Fig. 3.8 as a function of \(\gamma \). One can see that \(\gamma =0.11\) gives the best performance obtaining a fit around \(89\%\). Interestingly, such fit is larger than the best result obtained by LS through optimal tuning of the discrete model order, see Fig. 3.4. The base 10 logarithm of the condition number of \(\varPhi ^T\varPhi +\gamma I_n\), as a function of \(\gamma \), is displayed in Fig. 3.9. One can see that the matrix is much better conditioned now. Figure 3.10 plots the equivalent degrees of freedom of \(\hat{\theta }^{\text {R}}\). Even if \(n=16\), the actual model complexity in terms of equivalent degrees of freedom is much smaller, around 4 for the tested values of \(\gamma \). Finally, the estimates of any component of \(\theta \) obtained using the different values of \(\gamma \) are shown in Fig. 3.11.
\(\square \)
3.4.2.1 Regularization Design: The Optimal Regularizer
A natural question is how to design a regularization matrix P and select \(\gamma \) to obtain a “good” model estimate. From a “classic” or “frequentist” point of view, rational choices are those that make the MSE matrix (3.62d) small in some sense, as discussed below. For our purposes, it is useful to rewrite the MSE matrix (3.62d) as follows:
Then, it is useful to first introduce the following lemma.
Lemma 3.1
(based on [9]) Consider the matrix
where Q, R and Z are positive semidefinite matrices. Then for all Q
which means that \(M(Q)  M(Z)\) is positive semidefinite.
The proof consists of straightforward calculations and can be found in Sect. 3.8.2.
Using (3.66) and Lemma 3.1, the question which P and \(\gamma \) give the best MSE of \(\hat{\theta }^{\text {R}}\) has a clear answer: the equation \(\sigma ^2 P=\gamma \theta _0\theta _0^T\) needs to be satisfied. Thus, the following result holds.
Proposition 3.1
(Optimal regularization for a given \(\theta _0\), based on [9]) Letting \(\gamma =\sigma ^2\), the regularization matrix
minimizes the MSE matrix (3.66) in the sense of (3.67).
Note that the MSE matrix (3.66) is linear in \(\theta _0\theta _0^T\). This means that if we compute \(\hat{\theta }^{\text {R}}\) with the same P for a collection of true systems \(\theta _0\), the average MSE over that collection will be given by (3.66) with \(\theta _0\theta _0^T\) replaced by its average over the collection. In particular, if \(\theta _0\) is a random vector with \({\mathscr {E}}(\theta _0\theta _0^T)=\varPi \), we obtain the following result.
Proposition 3.2
(Optimal regularization for a random system \(\theta _0\), based on [9]) Consider (3.62d) with \(\gamma =\sigma ^2\). Then, the best average (expected) MSE for a random true system \(\theta _0\) with \({\mathscr {E}}(\theta _0\theta _0^T)=\varPi \) is obtained by the regularization matrix \(P=\varPi \).
Propositions 3.1 and 3.2 thus give a somewhat preliminary answer to our design problem. Since the best regularization matrix \(P=\theta _0\theta _0^T\) depends on the true system \(\theta _0\), such formula cannot be used in practice. Nevertheless, it suggests to choose a regularization matrix which mimics the behaviour of \(\theta _0\theta _0^T\). Using prior knowledge on the true system \(\theta _0\), this can be done by postulating a parametrized family of matrices \(P(\eta )\) with \(\eta \in \varGamma \subset {\mathbb R}^m\), where \(\eta \) is the socalled hyperparameter vector, \(\varGamma \) is the set where \(\eta \) can vary and m is the dimension of \(\eta \). Thus, the choice of a parametrized regularization matrix is similar to model structure selection in system identification. The nature of the optimal regularizer suggests also to set
However, the noise variance \(\sigma ^2\) is in general unknown and needs to be estimated from the data. One can adopt equations (3.22) or (3.23). Another option is to include \(\sigma ^2\) in \(\eta \) and then estimate it together with the other hyperparameters.
3.5 Regularization Tuning for Quadratic Penalties
3.5.1 Mean Squared Error and Expected Validation Error
Now, assume that a parametrized family of regularization matrices \(P(\eta )\) has been defined. The vector \(\eta \) is in general unknown and has to be tuned by using the available measurements. The ReLSQ estimate \(\hat{\theta }^{\text {R}}(\eta )\) in (3.58) depends on \(\eta \) and the estimation strategy depends on the measure used to quantify its quality. We will consider the following two criteria:

minimizing the MSE;

minimizing the expected validation error (EVE).
3.5.1.1 Minimizing the MSE
Still adopting a “classic” or “frequentist” point of view, a rational choice of \(\eta \) is one that makes the MSE matrix (3.62d) small in some sense. For ease of estimation, a scalar measure is often exploited. In [25, Chap. 12], it is suggested to use a weighting matrix Q and \({{\,\mathrm{trace}\,}}(\text {MSE}({\hat{\theta }^{\text {R}}(\eta )},\theta _0)Q)\) as a quality measure of \(\hat{\theta }^{\text {R}}(\eta )\), where Q reflects the intended use of the model \(\hat{\theta }^{\text {R}}(\eta )\). Then an estimate of \(\eta \), say \(\hat{\eta }\), is obtained as follows:
Note that (3.70) depends on the true system \(\theta _0\) that is unknown and thus cannot be used. In practice, we need to first find a “good” estimate, say \(\hat{\theta }\), of the true system \(\theta _0\) and then to replace \(\theta _0\) in (3.70) with \(\hat{\theta }\). Then, hopefully, a “good” estimate is given by
Different choices of \(\hat{\theta }\) and Q lead to different estimators (3.71). Examples are obtained setting \(\hat{\theta }\) to the LS estimate or to the ridge regression estimate of \(\theta _0\), while the choice \(Q=I_n\) is often used. In any case, the major difficulty underlying the idea of “minimizing the MSE” for hyperparameters tuning lies in whether or not \(\hat{\theta }\) is a “good” estimate of \(\theta _0\), which is actually our fundamental problem.
3.5.1.2 Minimizing the EVE
An alternative quality measure of \(\hat{\theta }^{\text {R}}(\eta )\) is related to model prediction capability on independent validation data and is characterized by the expected validation error (EVE).
To define it, we need to introduce the training/estimation data and the validation data. The training data is used for estimating the model and is contained in the set \(\mathscr {D}_\text {T}\). The validation data are used to assess model prediction capability and are in the set \(\mathscr {D}_\text {V}\).
Now, let \(\hat{\theta }^{\text {R}}(\eta )\) denote a general ReLSQ estimate parametrized by the vector \(\eta \) and obtained using only the training data \(\mathscr {D}_\text {T}\). Let \(y_\text {v}\in {\mathbb R}\), \(\phi _{\text {v}}\in {\mathbb R}^n\) be a validation sample pair. These objects could both be random, e.g., \(y_\text {v}\) can be affected by noise and the regressor could be defined by a stochastic system input. The validation error \(\text {EVE}_{\mathscr {D}_\text {T}}(\eta )\) is then given by
In the above equation, the expectation \({\mathscr {E}}\) is computed w.r.t. the joint distribution of \(y_\text {v}\) and \(\phi _{\text {v}}\) conditioned on the training data \(\mathscr {D}_\text {T}\). If \(\phi _{\text {v}}\in {\mathbb R}^n\) is deterministic and, as usual, \(y_\text {v}\) is affected by a noise independent by those entering the training set, the mean is taken just w.r.t. such noise, with \(\mathscr {D}_\text {T}\) which influences only \(\hat{\theta }^{\text {R}}\). In any case, the result is a function of the training set. Now, we can see \(\mathscr {D}_\text {T}\) as random and then the EVE is
where the expectation \({\mathscr {E}}\) is over the training set. Note that the final result is function of the true \(\theta _0\) which determines the probability distributions of the training and validation data.
The \(\text {EVE}(\eta )\) measures the prediction capability of the model \(\hat{\theta }^{\text {R}}(\eta )\) before seeing any training or validation data: the smaller the \(\text {EVE}(\eta )\), the better the expected model prediction capability. Therefore, it is natural to estimate \(\eta \) as follows:
However, as said, the above objective depends on the unknown vector \(\theta _0\) so that estimation of \(\theta \) is not possible in practice. The problem is analogous to that encountered when trying to tune \(\eta \) by minimizing the MSE
Remark 3.2
Interestingly, the idea of “minimizing the MSE” and the idea of “minimizing the EVE” are connected. To see this, we assume for simplicity that the regressors \(\phi _i\), \(i=1,\ldots ,N\) in the training data and \(\phi _{\text {v}}\) in the validation data are deterministic. Then it can be shown that
where the expectation \({\mathscr {E}}\) is over everything that is random, and \(MSE (\hat{\theta }^{R }(\eta ),\theta _0)\) is the MSE matrix of \(\hat{\theta }^{R }(\eta )\) defined in (3.62d). Clearly, (3.75) shows that minimizing \(EVE (\eta )\) with respect to \(\eta \) is equivalent to minimizing \({{\,\mathrm{trace}\,}}( MSE (\hat{\theta }^{R }(\eta ),\theta _0)Q)\) with respect to \(\eta \) when \(Q=\phi _{v }\phi _{v }^T\).
To overcome the fact that the EVE depends on the unknown \(\theta _0\), we could first find a “good” estimate of \(\text {EVE}(\eta )\) using the available data and then determine the hyperparameter vector by minimizing it. There are two ways to achieve this goal: by efficient sample reuse of the data and by considering the insample EVE instead. More details will be provided in the next two subsections.
3.5.2 Efficient Sample Reuse
One way to estimate \(\text {EVE}(\eta )\) by exploiting efficient sample reuse includes crossvalidation (CV) [41] and its variants already mentioned in Sects. 2.6.3 and 3.2.2 when discussing model order selection.
3.5.2.1 Hold Out CrossValidation
The simplest CV is the socalled hold out CV (HOCV), which is widely used to select the model order for the classical PEM/ML. The HOCV can also be used to estimate the hyperparameter \(\eta \in \varGamma \) for the ReLSQ method.
The idea of hold out CV is to first split the given data into two parts: the training data \(\mathscr {D}_\text {T}\) and the validation data \(\mathscr {D}_\text {V}\). The prediction capability is measured in terms of the validation error. The model that gives the smallest validation error will be selected. More specifically, the HOCV takes the following three steps:

(1)
Split the given data into two parts: \(\mathscr {D}_\text {T}\) and \(\mathscr {D}_\text {V}\).

(2)
Estimate the model \(\hat{\theta }^{\text {R}}(\eta )\) based on \(\mathscr {D}_\text {T}\) for different values of \(\eta \in \varGamma \).

(3)
Calculate the validation error for \(\hat{\theta }^{\text {R}}(\eta )\) over the validation data \(\mathscr {D}_\text {v}\):
$$\begin{aligned} \text {CV}(\eta )&= \sum _{(y_\text {v},\phi _\text {v})\in \mathscr {D}_\text {v} } (y_\text {v}\phi _\text {v}^T\hat{\theta }^{\text {R}}(\eta ))^2, \end{aligned}$$where the summation is over all pairs of \((y_\text {v},\phi _\text {v})\) in the validation data \(\mathscr {D}_\text {v}\). Then, select the value of \(\eta \) that minimizes \(\text {CV}(\eta )\):
$$\begin{aligned} \hat{\eta } = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{\eta \in \varGamma }\ \text {CV}(\eta ). \end{aligned}$$(3.76)
It is also possible to change the role of the training and validation sets in order to perform a second validation step: the model is estimated on the previous validation set and the validation error is computed on the previous training set. Finally, the final validation error is obtained by averaging the two validation errors.
3.5.2.2 kFold CrossValidation
The HOCV with swapped sets is a special case of the more general kfold CV with \(k=2\), e.g., [24]. If the data set size is small, the HOCV may perform poorly. In fact, the training data may not be sufficiently rich to build good models and a validation set of small size may give a too uncertain validation error. In this case, the kfold CV with \(k>2\) could be used.
The idea of kfold CV is to first split the data into k parts of equal size. For every \(\eta \in \varGamma \), the following procedure is repeated k times. At the ith run with \(i=1,2,\ldots ,k\):

(1)
Retain the ith part as the validation data \(\mathscr {D}_{\text {V},i}\), and use the remaining \(k1\) parts as the training data \(\mathscr {D}_{\text {T},i}\).

(2)
Estimate \(\hat{\theta }^{\text {R}}(\eta )\) based on the training data \(\mathscr {D}_{\text {T},i}\) and then calculate the validation error over the validation data \(\mathscr {D}_{\text {V},i}\)
$$\begin{aligned} \text {CV}_{i}(\eta )&= \sum _{(y_\text {v},\phi _\text {v})\in \mathscr {D}_{\text {V},i} } (y_v\phi _v^T\hat{\theta }^{\text {R}}(\eta ))^2, \end{aligned}$$where the summation is over all pairs of \((y_\text {v},\phi _\text {v})\) in the validation data \(\mathscr {D}_{\text {V},i}\).
Finally, the k validation errors \(\text {CV}_{i}(\eta )\) so obtained are summed to obtain the following total validation error for \(\eta \):
and the estimate of \(\eta \) is finally given by
3.5.2.3 Predicted Residual Error Sum of Squares and Variants
The computation of the kfold CV is often expensive and an exception is the leaveoneout CV (LOOCV) where the validation set includes only one validation pair. When the square loss function is used, the total validation error admits a closedform expression and the LOOCV is also known as the predicted residual error sum of squares (PRESS), e.g., [2].
First, recall the linear regression model (3.13) and the corresponding data \(y_i\in {\mathbb R}\) and \(\phi _i\in {\mathbb R}^n\) for \(i=1,\ldots ,N\). Then the ReLSQ estimate is
where we have set \(\gamma =\sigma ^2\) following (3.69). For the kth measured output \(y_k\), the corresponding predicted output \(\hat{y}_k\) and residual \(r_k\) are, respectively,
Then, PRESS selects the value of \(\eta \in \varGamma \) that minimizes the sum of squares of the validation errors. One can prove that this corresponds to the following problem:
where \(r_k\) are defined by (3.79) while
The derivation of (3.80) can be found in Sect. 3.8.3. It is worth noting that the denominator in (3.80) is strictly related to the diagonal entries of the hat matrix H defined in (3.63). In fact,
so that
Hence, interestingly, one can conclude that PRESS evaluation requires to compute just the ReLSQ estimate exploiting the full data set (instead of solving N problems, one for each missing measurement in the training set).
One method that is closely related with PRESS is the socalled generalized crossvalidation (GCV), e.g., [18]. GCV is obtained by replacing in (3.80) the factors \(H_{kk}\) by their average, i.e., \({{\,\mathrm{trace}\,}}(H)/N\):
Recalling (3.65), the term \({{\,\mathrm{trace}\,}}(H)\) defines the degrees of freedom of \(\hat{\theta }^{\text {R}}\). Hence, the GCV criterion can be rewritten as follows:
3.5.3 Expected InSample Validation Error
In the definition of the validation error \(\text {EVE}_{\mathscr {D}_\text {T}}\) (3.72), reported for convenience also below
we assumed that the conditional expectation \({\mathscr {E}}\) is over the independent validation sample pair \(y_\text {v}\in {\mathbb R}\), \(\phi _{\text {v}}\in {\mathbb R}^n\), which are drawn randomly from their joint distribution. The computation of the validation error (3.72) could become easier if independent validation sample pairs \(y_\text {v}\in {\mathbb R}\), \(\phi _{\text {v}}\in {\mathbb R}^n\) are generated in a particular way.
For linear regression problems, it is convenient to assume that the same deterministic regressors \(\phi _i\), \(i=1,2,\ldots ,N\), are used for generating both the training data and the validation data. To be specific, still using \(\theta _0\) to denote the true parameter vector, we recall from (3.6), that the training output samples are
In this case, the training set is
Using the same regressors \(\phi _i\), consider a set of validation output samples \(y_{\text {v},i}\) as follows:
where \(\theta _0\) is the true parameter vector, with the noises \(e_{i}\) and \(e_{\text {v},i}\) assumed identically and independently distributed. The validation error is now denoted by \({\text {EVE}_{\text {in}}}_{\mathscr {D}_\text {T}}(\eta )\), computed as follows:
and called insample validation error [21, p. 228]. Note that, similarly to what discussed after (3.72), the expectation \({\mathscr {E}}\) in (3.86) is computed w.r.t. the joint distribution of the couples \(y_{\text {v},i},\phi _i\) conditioned on the training data \(\mathscr {D}_\text {T}\). Thus, the result is function of the training set. As done in (3.73), we can remove such dependence by computing the expected insample validation error as
with expectation taken over the joint distribution of the training data. In what follows, we will see how to build an unbiased estimator of \(\text {EVE}_{\text {in}}(\eta )\) using the training data (3.84), and how to exploit it for hyperparameters tuning.
3.5.3.1 Expectation of the Sum of Squared Residuals, Optimism and Degrees of Freedom
To estimate \(\text {EVE}_{\text {in}}(\eta )\), consider the sum of squared residuals
which is function only of the training set. Its expectation w.r.t. the training data (3.84) is
One expects \(\text {EVE}_{\text {in}}(\eta )\) to be not smaller than \(\overline{\text {err}}(\eta ) \) because this latter quantity exploits the same data to fit the model and to assess the error. This intuition is indeed true as shown in the following theorem whose proof is in Sect. 3.8.4.
Theorem 3.7
Consider the linear regression model (3.13) with the training data (3.84), the validation data (3.85) and the ReLSQ estimate (3.58). Then it holds that
Theorem 3.7 shows that the expectation of the sum of squares of the residuals is an overly optimistic estimator of the expected insample validation error \(\text {EVE}_{\text {in}}(\eta )\). The difference between \(\text {EVE}_{\text {in}}(\eta )\) and \(\overline{\text {err}}(\eta )\) is called the optimism in statistics. In particular, one has, see, e.g., [21, p. 229]:
where rewriting (3.83) as
and defining the output prediction as
it holds that
Combining arguments contained in the proof of Theorem 3.7 reported in the appendix to this chapter, see, in particular, (3.164), with the definition of equivalent degrees of freedom in (3.65), one obtains that
This thus reveals the deep connection between the optimism and the equivalent degrees of freedom.
3.5.3.2 An Unbiased Estimator of the Expected InSample Validation Error
Exploiting (3.94), we can now rewrite (3.91) as
Interestingly, on the lefthand side of (3.95), \(\text {EVE}_{\text {in}}(\eta )\), by definition (3.87), is the mean of a random variable which depends on both the training data (3.84) and the validation data (3.85). Instead, on the righthand side of (3.95), \(\overline{\text {err}}(\eta )\) is the expectation of a random variable which depends only on the training data. Hence, an unbiased estimator \(\widehat{\text {EVE}_{\text {in}}}(\eta )\) of \(\text {EVE}_{\text {in}}(\eta )\) is obtained just replacing \(\overline{\text {err}}(\eta )\) with \(\overline{\text {err}}(\eta )_{\mathscr {D}_\text {T}}\) reported in (3.88). One thus obtains
So, after observing the training data (3.84), the hyperparameter \(\eta \) can be estimated as follows:
The hyperparameter estimation criterion (3.97) has different names in statistics: it is known as the CP statistics, e.g., [27] and Stein’s unbiased risk estimator (SURE), e.g., [40].
Interestingly, as it will be clear from the proof of Theorem 3.7, the above formula (3.97) still provides an unbiased prediction risk estimator also if we replace \(\varPhi \theta _0\) in (3.92) with a generic vector \(\mu \) s.t. \(Y=\mu +E\). Hence, one does not need to assume the existence of the true \(\theta _0\) and of a regression matrix which describes the linear input–output relation. A variant of the expected insample validation error is also discussed in Sect. 3.8.5.
3.5.3.3 Excess Degrees of Freedom*
In the previous subsection, we have discussed how to construct an unbiased estimator of the expected insample validation error, see (3.96), and how to use it for hyperparameters tuning, see (3.97). Irrespective of the particular method adopted for hyperparameter estimation, the estimate \(\hat{\eta }\) of \(\eta \) depends on the data Y, with the regression matrix \(\varPhi \) here assumed deterministic and known. We stress this by writing
Accordingly, the ReLSQ estimate (3.58) with \(\eta \) replaced by \( \hat{\eta }(Y)\) becomes
Since \(\hat{\eta }\) is a random vector, to design a true unbiased estimator of the expected insample validation error of \(\hat{\theta }^{\text {R}}(\hat{\eta }(Y))\) one should not use (3.96) since it assumes the hyperparameter \(\eta \) constant.
In what follows, we will derive an unbiased estimator of the expected insample validation error of \(\hat{\theta }^{\text {R}}(\hat{\eta }(Y))\). Such an estimator will thus be able to account also for the price of estimating model complexity (the degrees of freedom) from data. To this goal, we need the following version of Stein’s Lemma [40], a simplified version of which was already introduced in Chap. 1.
Lemma 3.2
(Stein’s Lemma, adapted from [40]) Consider the following additive measurement model:
where \(\mu \) is an unknown constant vector and \(\varepsilon \sim N(0,\varSigma )\). Let \(\hat{\mu }(x)\) be an estimator of \(\mu \) based on the data x such that \(\text {Cov}(\hat{\mu }(x),x)\) and \({\mathscr {E}}(\frac{\partial \hat{\mu }(x)}{\partial x})\) exist. Then
Let
so that (3.85) can be rewritten as
Now, let us consider the measurements model (3.92) and the validation data (3.100), assuming also that
Then, using the correspondences
together with (3.161) in the appendix to this chapter, one can prove that
Using Stein’s Lemma, one has
Therefore, it holds that
If \(\hat{\eta }=\hat{\eta }(Y)\) were independent of Y, the above objective would coincide with the SURE score reported in (3.97). The difference is instead the presence of the term \(2\sigma ^2\frac{1}{N}{{\,\mathrm{trace}\,}}({\mathscr {E}}[\frac{\partial f(Y,\hat{\eta })}{\partial \hat{\eta }}\frac{\partial \hat{\eta }}{\partial Y}])\). It represents the extra optimism induced by the estimation of \(\eta \) and is due to the randomness of the data Y entering the hyperparameter estimator. The term \({{\,\mathrm{trace}\,}}({\mathscr {E}}[\frac{\partial f(Y,\hat{\eta })}{\partial \hat{\eta }}\frac{\partial \hat{\eta }}{\partial Y}])\) is called the excess degrees of freedom [33] and denoted by
From (3.102), we readily obtain an unbiased estimator of \(\text {EVE}_{\text {in}}\) as follows:
where \(\widehat{\text {exdof}(\hat{Y}(\hat{\eta }))}\) is an unbiased estimator of \(\text {exdof}(\hat{Y}(\hat{\eta }))\). As discussed in [33], (3.104) can be used to compare different regularized estimators also in terms of the different complexity of the hyperparameters tuning strategies that they adopt.
3.6 Regularized Least Squares with Other Types of Regularizers \(\star \)
The general ReLS criterion assumes the following form
The different choices of the regularization term \(J(\theta )\) depend on the prior knowledge regarding \(\theta _{0}\). Having discussed the quadratic penalty, we will now consider two other important choices for \(J(\theta )\) given by the \(\ell _1\) or nuclear norm.
3.6.1 \(\ell _1\)Norm Regularization
ReLS with \(\ell _1\)norm regularization leads to
where \(\Vert \theta \Vert _1\) represents the \(\ell _1\)norm of \(\theta \), i.e., \(\Vert \theta \Vert _1=\sum _{i=1}^n\theta _i\) with \(\theta _i\) being the ith element of \(\theta \). The problem (3.105) is also known as the least absolute shrinkage and selection operator (LASSO) [42] and is equivalently defined as follows:
where \(\beta \ge 0\) is a tuning parameter connected with \(\gamma \) that controls the sparsity of \(\theta \).
3.6.1.1 Computation of Sparse Solutions
LASSO (3.105) has been widely used for finding sparse solutions. In signal processing, such problem has wide applications in compressive sensing for finding sparse signal representations from redundant dictionaries. In machine learning and statistics, the problem has also been applied extensively for variable selection where the aim is to select a subset of relevant variables to use in model construction.
Recall that a vector \(\theta \in {\mathbb R}^n\) is said to be sparse if \(\Vert \theta \Vert _0 \ll n\), where \(\Vert \theta \Vert _0\) is the \(\ell _0\) norm of \(\theta \) which counts the number of nonzero elements of \(\theta \). For linear regression models, sparse estimation requires to find a sparse \(\theta \) able to well fit the data, i.e., such that \(\Vert Y\varPhi \theta \Vert _2^2\) is small. More formally, the problem is defined as follows:
where \(Y\in {\mathbb R}^N,\theta \in {\mathbb R}^{n}\) with \(n>N\), \(\varPhi \in {\mathbb R}^{N\times n}\) assumed of full rank, i.e., \({{\,\mathrm{rank}\,}}(\varPhi )=N\), and \(\varepsilon \ge 0\) is a tuning parameter that controls the data fit.
The problem (3.107) is known to be NPhard, e.g., [31]. It is combinatorial and finding its solution requires an exhaustive search. Hence, one needs approximated methods. The most popular technique relies on a convex relaxation of (3.107) obtained by replacing the \(\ell _0\)norm with the \(\ell _1\)norm:
By using the method of Language multipliers, it can be shown that the convex relation (3.108) is equivalent to LASSO (3.105).
A natural question is whether or not the solution of LASSO (3.105) can be sparse. The answer is affirmative. For illustration, we first show this feature when the regression matrix \(\varPhi \) is orthogonal and assuming \(N=n\).
3.6.1.2 LASSO Using an Orthogonal Regression Matrix
Let us consider (3.105) with orthogonal regression matrix \(\varPhi \), i.e., \(\varPhi ^T\varPhi =\varPhi \varPhi ^T=I_n\). Then (3.105) is rearranged as follows:
where \(\hat{\theta }_i^{\text {LS}}\) is the ith element of \(\hat{\theta }^{\text {LS}}\).
To derive the optimal solution \(\hat{\theta }^\text {R}\), we first recall the definition of subderivative and subdifferential of a convex function \(f:X\rightarrow {\mathbb R}\) with X being an open interval. The subderivative of a convex function \(f:X\rightarrow {\mathbb R}\) at a point \(x_0\) in the open interval X is a real number a such that
for all x in X. It can be shown that there exist b and c with \(b\le c\) such that the set of subderivatives at \(x_0\) for a convex function is a nonempty closed interval \([b,\ c]\), where b and c are the onesided limits defined as follows:
The closed interval \([b,\ c]\) is called the subdifferential of f(x) at the point \(x_0\).
Then, considering (3.109), \(\hat{\theta }^\text {R}\) is an optimal solution if
where \(\hat{\theta }_i^{\text {R}}\) is the ith element of \(\hat{\theta }^{\text {R}}\) and \(\partial \hat{\theta }^\text {R}_i\) represents the subdifferential of \(\hat{\theta }^\text {R}_i\) which is equal to
Using (3.110) and (3.111), we obtain the following explicit solution of LASSO for orthogonal \(\varPhi \):
From (3.112) one can see that the solution of LASSO will be sparse if many absolute values of the elements of \(\hat{\theta }^{\text {LS}}\) are smaller than \(\gamma /2\). So, \(\gamma \) can be used to tune the sparsity of \(\theta \). It can also be seen that the nonzero elements of the solution of LASSO are biased and that, compared with the LS solution, they are shrunk towards zero (translated towards zero by a constant factor \(\gamma /2\)).
3.6.1.3 LASSO Using a Generic Regression Matrix: Geometric Interpretation
For a generic nonorthogonal \(\varPhi \), LASSO in general has no explicit solutions. To understand why it can still induce sparse solutions, we can use the geometric interpretation of LASSO in the form of (3.106) with \(\theta \in {\mathbb R}^2\). In Fig. 3.12, one can see that for the first case coloured in blue (resp., the third case coloured in brown), if the elliptical contour is rotated slightly about the axis perpendicular to the paper and through the blue (resp., brown) cross, the optimal solution of (3.106) will still have a zero \(\theta _1\)element (resp., \(\theta _2\)element). This explains why LASSO can often induce sparse solutions with a suitable choice of the regularization parameter.
Finally, since the cost function of LASSO (3.105) is a convex function of \(\theta \), many standard convex optimization software packages are available to obtain numerical solutions of LASSO very efficiently, such as YALMIP [26], CVX [19], CVXOPT [3], CVXPY [11].
Example 3.7
(Polynomial regressionLASSO) We revisit the polynomial regression Examples (3.26) and (3.27) with LASSO (3.105). In particular, we set the model order to \(n=16\), with the regression matrix \(\varPhi \) built according to (3.4) and (3.12). Moreover, we let \(\gamma =\gamma _i\), \(i=1,\ldots ,16\) with \(\gamma _1=0.01\), \(\gamma _{16}=0.31\) and \(\gamma _2,\ldots ,\gamma _{15}\) evenly spaced between \(\gamma _1\) and \(\gamma _{16}\). For each \(\gamma =\gamma _i\), we compute the corresponding solution of the LASSO (3.105). In particular, the estimates \(\hat{g}(x) = \phi (x)^T\hat{\theta }^{\text {R}}\) for \(x=x_i\), with \(i=1,\ldots ,40\), are plotted in Fig. 3.13.
The model fits (3.28) obtained for different \(\gamma \) are shown in Fig. 3.14. One can see that \(\gamma =0.15\) gives the best result.
Finally, the LASSO estimates of the components of \(\theta \) obtained using the different values of \(\gamma \) are shown in Fig. 3.15. It is evident that the LASSO estimate (3.105) is sparse. Comparing it with the ridge regression estimates reported in Fig. 3.11, one can conclude that LASSO may give a simpler model, i.e., depending only on a limited number of components of \(\theta \). \(\square \)
3.6.1.4 Sparsity Inducing Regularizers Beyond the \(\ell _1\)Norm
We have seen that the \(\ell _1\)norm plays a key role for sparse estimation. However, as shown in [34], there are many other sparsity inducing regularizers. Let l be any concave and nondecreasing function on \([0,\ \infty )\), three examples being reported in the top panel of Fig. 3.16. Then, other penalties which promote sparsity assume the form \(J(\theta ) = \sum _{i=1}^n l(\theta _i^2)\) and are given by
Some of them are displayed in the bottom panel of Fig. 3.16. The use of nonconvex penalties may increase the sparsity in the solution but the drawback is that optimization problems possibly exposed to local minima must be handled.
3.6.1.5 Presence of Outliers and Robust Regression
In practical applications, it may happen that the measurement outputs \(y_i\) so far described by the model
may be contaminated by outliers which represent unexpected noise model deviations. They can be due to the failure of some sensors or to mistakes in the setting of the experiment. In this case, data can actually be generated by the following system:
where the \(e_i\) form a white noise with mean zero and variance \(\sigma ^2\) while the \(v_{0,i}\) represents the outliers which are assumed to be zero most of time. Hence, the vector
is assumed to be sparse.
When data come from (3.114), straightforward application of the LS method may lead to a poor estimate \(\hat{\theta }^{\text {LS}}\) of \(\theta _0\). For illustration, let us consider an extreme case by assuming \(v_{0,i}=0\) for \(i=1,2,\ldots ,N1\) while the \(\phi ^T_i\theta _0+e_i\) for \(i=1,\ldots ,N\) are all negligible compared to \(v_{0,N}\). LS leads to
The first \(N1\) terms in the above cost function are the same encountered in absence of outliers while the last term is different due to \(v_{0,N}\). The \(\phi ^T_i\theta _0+e_i\), \(i=1,\ldots ,N\) are negligible compared to \(v_{0,N}\), a phenomenon then further amplified by the quadratic criterion here adopted. To make the last term as small as possible, \(\hat{\theta }^{\text {LS}}\) will mainly tend to fit only \(v_{0,N}\). Hence, the terms \(\phi ^T_i\theta _0+e_i\) which carry information on the true system will be little regarded. This will lead to a poor estimate of \(\theta _0\).
Many robust regression methods are available hinging on loss functions less sensitive to outliers than the square loss. An example is Huber estimation
where the Huber loss function \(l^{\text {Huber}}\) is defined as follows:
In (3.116), the parameter \(\gamma >0\) is a tuning parameter whose role will become clear shortly. The Huber loss function (3.116) is less sensitive to outliers because it grows linearly for \(x\ge \gamma /2\). Note that a limit case of the Huber loss is the \(\ell _1\)norm obtained with \(\gamma \) which tends to zero.
3.6.1.6 An Equivalence Between \(\ell _1\)Norm Regularization and Huber Estimation
Let
Consider the \(\ell _1\)norm regularization given by
whose peculiarity is to require joint optimization w.r.t. the parameter vector \(\theta \) and the outliers \(v_{0,i}\) contained in \(V_0\). Interestingly, (3.117) is actually equivalent to Huber estimation (3.115), i.e., they have the same optimal solution. To show this, one needs just to prove that
The righthand side of (3.118) corresponds to LASSO (3.105) with an orthogonal regression matrix given by the identity. It thus follows from (3.112) that the components of the optimal solution \(\hat{V}_0^{\text {R}}\) admit the following closedform expression:
Now we replace \(V_0\) in the cost function of the righthand side of (3.118) with \(\hat{V}_0^{\text {R}}\) and it is straightforward to check that the following identify holds:
Therefore, (3.117) is indeed equivalent to the Huber estimation (3.115).
3.6.2 Nuclear Norm Regularization
So far the output Y, the parameter \(\theta \) and the noise E in (3.13) have been assumed to be vectors. In what follows, we allow them to be matrices and consider the following linear regression model:
The ReLS with nuclear norm regularization takes the following form:
where \(\Vert \cdot \Vert _F\) is the Frobenius norm of a matrix, \(h(\theta )\) is a matrix that is affine in \(\theta \) and \(\Vert h(\theta )\Vert _*\) is the nuclear norm of the matrix \(h(\theta )\), see also Sect. 3.8.1, the appendix to this chapter, for a brief review of matrix and vector norms.
3.6.2.1 Nuclear Norm Regularization for Matrix Rank Minimization
Matrix rank minimization problems (RMP) are a class of optimization problems that involve minimizing the rank of a matrix subject to convex constraints. They are often encountered in signal processing, image processing and statistics. For example, a typical statistical problem is to obtain a lowrank covariance matrix able to describe some available data and/or consistent with some prior assumptions. Formally, the RMP is defined as follows:
with X belonging to a convex set \(\mathfrak C\) while \({{\,\mathrm{rank}\,}}(X)\) describes the order (complexity) of the underlying model.
In general, the RMP (3.123) is NPhard and thus there is need for approximated methods. Several heuristic methods have been proposed, such as the nuclear norm heuristic [14] and the logdet heuristic [15]. In particular, for a convex set \(\mathfrak {C}\) the convex envelope of a function \(f:\mathfrak {C}\rightarrow {\mathbb R}\) is defined as the largest convex function g such that \(g(x)\le f(x)\) for every \(x\in \mathfrak C\), e.g., [22]. For a nonconvex f, solving
may be difficult. In this case, if it is possible to derive the convex envelope g of f, then
turns out a convex approximation of (3.124) and, in particular, the minimum of (3.125) can represent a lower bound of that of (3.124). Moreover, if necessary, the minimizing argument of (3.125) can be chosen as the initial point for a more complicated nonconvex local search aiming to solve (3.124).
As shown in Theorem 1 of [13, Chap. 5], the convex envelope of the rank function \({{\,\mathrm{rank}\,}}(X)\) with \(X\in \mathfrak {C}=\{X \Vert X\Vert _2\le 1, X\in {\mathbb R}^{n\times m}\}\) is the nuclear norm of X, i.e., \(\Vert X\Vert _*\). As a result, the nuclear norm heuristic to solve the RMP (3.123) is obtained by replacing the rank of X with the nuclear norm of X, i.e.,
Without loss of generality, we assume that \(X\in \mathfrak {C}=\{X \  \ \Vert X\Vert _2\le M, X\in {\mathbb R}^{n\times m}\}\) for some \(M>0\). Then, from the definition of the convex envelope, for \(X\in \mathfrak {C}\) we have
In addition
where \(X^{\text {opt}}\) and \(X^{\text {copt}}\) denote the optimal solution of the RMP (3.123) and that of the nuclear norm heuristic (3.126), respectively. The inequalities in (3.127) thus provide an upper and lower bound for the optimal solution of the RMP (3.123).
As shown in [13, Chap. 5], the nuclear norm heuristic (3.126) can be equivalently formulated as a semidefinite program (SDP):
where \(Y\in {\mathbb R}^{n\times n}, Z\in {\mathbb R}^{m\times m}\) and both Y and Z are symmetric. The SDP problem (3.128) can be solved by interior point methods. For this purpose, some convex optimization software packages which can be used include YALMIP [26], CVX [19], CVXOPT [3] and CVXPY [11].
3.6.2.2 Application in Covariance Matrix Estimation with LowRank Structure
Now we go back to the linear regression model (3.121) and the ReLS with nuclear norm regularization (3.122). Consider the problem of covariance matrix estimation with lowrank structure, e.g., [38]. In particular, in (3.121), we take \(N=m=n\), let Y be a sample covariance matrix, \(\varPhi =I_n\), and \(\theta _0\) be a positive semidefinite matrix which has lowrank structure. Moreover, in (3.122), we take \(h(\theta )=\theta \). We can then obtain a matrix estimate \(\hat{\theta }^{\text {R}}\) with lowrank structure using ReLS with nuclear norm regularization as follows:
for a suitable choice of \(\gamma >0\). An example is reported below.
Example 3.8
(Covariance matrix estimation problem) First, we construct a blockdiagonal rankdeficient covariance matrix \(\theta _0\) that has 4 blocks denoted by \(A_i\in {\mathbb R}^{n_i\times n_i}\) with \(n_1=20\), \(n_2=10\), \(n_3=5\) and \(n_4=15\). Using \(blkdiag \) to represent a blockdiagonal matrix, one thus has \(\theta _0=blkdiag (A_1,A_2,A_3,A_4)\). Each \(A_i\) is generated by summing up \(v_{i,j}v_{i,j}^T\), \(j=1,\ldots ,n_i2\), where the \(v_{i,j}\) are \(n_i\)dimensional vectors with components independent and uniformly distributed on \([1,\ 1]\). It comes that \(rank (\theta _0)=42\) since the rank of each ith block is \(n_i2\). Then we draw 20000 samples \(x_i\) from the Gaussian distribution \(\mathscr {N}(0,\theta _0)\). The available measurements are \(z_i=x_i+e_i\) where the \(e_i\) are independent and distributed as \(\mathscr {N}(0,0.6)\). Using the \(z_i\) we calculate the sample covariance Y as follows:
We solve the ReLS problem (3.129) with the data Y defined above and \(\gamma \) in the set \(\{0.1411,0.1414,0.1419,0.1423,0.1427\}\), obtaining different estimates \(\hat{\theta }^{\text {R}}\) of the covariance matrix.
The top panel of Fig. 3.17 shows the base 10 logarithm of the 50 estimated singular values. Each profile is obtained with a different regularization parameter. Such results show that, seeing the tiny singular values as null, a suitable value of the regularization parameter, like \(\gamma =0.1427\), leads to \(rank (\hat{\theta }^{\text {R}})=42\). Note in fact that the green curve, which is associated to such \(\gamma \), has a jump towards zero when passing from 42 to 43 on the xaxis. The influence of the nuclear norm regularization is also visible in the bottom panel which shows the profile of the relative error of \(\hat{\theta }^{\text {R}}\) as a function of \(\gamma \). When \(\gamma \) is small, e.g., \(\gamma =0.1411\), the influence is invisible, \(\hat{\theta }^{\text {R}}\) is almost the same as the sample covariance Y and \(rank (\hat{\theta }^{\text {R}})=50\). When \(\gamma \) becomes larger, the regularization influence becomes more visible, making \(\hat{\theta }^{\text {R}}\) closer to the true covariance \(\theta _0\). \(\square \)
3.6.2.3 Vector Case: \(\ell _1\)Norm Regularization
The nuclear norm heuristic and inequalities (3.127) also justify the use of the \(\ell _1\)norm regularization (3.108) for the problem of finding sparse solutions (3.107).
For the vector case, i.e., \(\theta \in {\mathbb R}^{n\times m}\) with \(m=1\), we can take X and \(\mathfrak {C}\) in the previous section to be \(X=\theta \) and \(\mathfrak {C}=\{\theta \in {\mathbb R}^n\Vert Y\varPhi \theta \Vert _2^2\le \varepsilon \}\). Then it is easy to see that the \(\ell _1\)norm is the convex envelope of the \(\ell _0\)norm for \(\Vert \theta \Vert _\infty \le 1\), i.e.,
Then, the RMP (3.123) and the nuclear norm heuristic (3.126) become the problem of finding sparse solutions (3.107) and the \(\ell _1\)norm regularization (3.108), respectively. Similar to what is done to obtain (3.127), we assume that \(\Vert \theta \Vert _\infty \le M\) for some \(M>0\). If \(\Vert \theta \Vert _\infty \le M\), one has
where \(\theta ^{\text {opt}}\) and \(\theta ^{\text {copt}}\) denote the optimal solution of the problem of finding sparse solution (3.107) and that of the \(\ell _1\)norm regularization (3.108), respectively. Similar to the matrix case, (3.131) provides an upper and lower bound for the optimal solution of the sparse estimation problem (3.107).
3.7 Further Topics and Advanced Reading
The systematic treatment of the regression theory is available in many textbooks, e.g., [12, 35]. The noise variance estimation is a critical issue in practical applications and has been discussed in details in [48]. When the regression matrix is illconditioned, it is important to make sure that the least squares estimate is calculated in an accurate and efficient way, e.g., [10, 17]. Moreover, for the regularized least squares in quadratic form, the regularization matrix could also be illconditioned. In this case, extra care is required in the calculation of both the regularized least squares estimate and the hyperparameter estimates, e.g., [8]. For given data, the quality of a model depends on the control of its complexity, which can be described by different measures in different contexts, e.g., the model order and the equivalent degrees of freedom. A good exposition of model complexity and its selection can be found in [21]. It is worth to mention that the degrees of freedom for LASSO have also been defined and discussed in [43, 51]. In practical applications, there are two key issues for the regularized least squares with quadratic regularization: the design of the regularization matrix and the estimation of the hyperparameter. While the latter issue has been discussed extensively in the literature, e.g., [21, 36, 46, 47], there are much fewer results on the former issue in the context of system identification, as discussed in [7]. The asymptotic properties of some widely used hyperparameter estimators, such as the maximum marginal likelihood estimator, Stein’s unbiased risk estimator, generalized crossvalidation, etc., have been reported in [29, 30]. LASSO and its variants have been extremely popular in practical applications, as described in [16, 28, 32, 50]. The nuclear norm heuristic to solve matrix rank minimization problems has wide applications in practical applications, see, e.g., [5, 6, 14, 15, 37]. Beyond the Huber loss function [23], the square loss function can be replaced also by other convex functions like the Vapnik loss function [45] as discussed later on in Chap. 6.
3.8 Appendix
3.8.1 Fundamentals of Linear Algebra
In this section, we review some fundamentals of linear algebra used in this chapter.
3.8.1.1 QR Factorization and Singular Value Decomposition
We begin with giving the definitions of QR factorization and SVD, which are very important decompositions used for many purposes other than solving LS problems.
For any \(\varPhi \in {\mathbb R}^{N\times n}\) with \(N\ge n\), \(\varPhi \) can be decomposed as follows:
where \(Q\in {\mathbb R}^{N\times N}\) is orthogonal, i.e., \(Q^TQ=QQ^T=I_N\), and \(R\in {\mathbb R}^{N\times n}\) is upper triangular. Further assume that \(\varPhi \) has full rank. Then \(\varPhi \) can be decomposed as follows:
where \(Q_1=Q(:,1:n)\) and \(R_1=R(1:n,1:n)\) with Q( : , 1 : n) being the matrix consisting of the first n columns of Q and R(1 : n, 1 : n) being the matrix consisting of the first n rows and n columns of R. The factorizations (3.132) and (3.133) are called the full and thin QR factorization, respectively. In particular, when \(R_1\) has positive diagonal entries, the thin QR factorization (3.133) is unique.
We start providing the “economy size” definition of the SVD. For any \(\varPhi \in {\mathbb R}^{N\times n}\) with \(N\ge n\), \(\varPhi \) can be decomposed as follows:
where \(U\in {\mathbb R}^{N\times n}\) satisfies \(U^TU=I_N\), \(\varLambda = {{\,\mathrm{diag}\,}}(\sigma _1,\sigma _2,\dots ,\sigma _n)\) with \(\sigma _1\ge \sigma _2\ge \dots \ge \sigma _n\ge 0\), and \(V\in {\mathbb R}^{n\times n}\) is orthogonal. The factorization (3.134) is called the singular value decomposition (SVD) of \(\varPhi \) and the \(\sigma _i\), \(i=1,\dots ,n\) are called the singular values of \(\varPhi \).
The SVD admits also the “full size” formulation, as given in (3.29). One has that (3.134) still holds but U is an orthogonal \(N\times N\) matrix and \(\varLambda \) is a rectangular \(N\times n\) diagonal matrix, while V is still an orthogonal \(n\times n\) matrix. In this second formulation, V and U can be associated to orthonormal change of coordinates in the domain and codomain of \(\varPhi \) such that, in the new coordinates, the linear operator is diagonal.
3.8.1.2 Vector and Matrix Norms
Important vector norms are the \(\ell _1\), \(\ell _2\) and \(\ell _\infty \) norms. For a given vector \(\theta \in {\mathbb R}^n\), they are denoted by \(\Vert \theta \Vert _1,\Vert \theta \Vert _2\) and \(\Vert \theta \Vert _\infty \), respectively, and are defined as follows:
where the \(\ell _2\) norm is also known as the Euclidean norm.
Important matrix norms are the nuclear norm, the Frobenius norm and the spectral norm. For a given matrix \(\varPhi \in {\mathbb R}^{N\times n}\) with \(N\ge n\), these three matrix norms are denoted by \(\Vert \varPhi \Vert _*,\Vert \varPhi \Vert _\text {F}\) and \(\Vert \varPhi \Vert _2\), respectively, and are defined as follows:
where \(\sigma _i(\varPhi )\) represents the ith largest singular value of \(\varPhi \), \(\sigma _{\text {max}}(\varPhi )=\sigma _1(\varPhi )\) and \(\varPhi _{i,j}\) is the (i, j)th element of \(\varPhi \).
Now, we report some properties of the vector and matrix norms. The ith largest singular value of \(\varPhi \) is equal to the square root of the ith largest eigenvalue of \(\varPhi ^T\varPhi \), or equivalently \(\varPhi \varPhi ^T\). If \(\varPhi \) is square and positive semidefinite, then the nuclear norm of \(\varPhi \) is equal to the trace of \(\varPhi \), i.e., \(\Vert \varPhi \Vert _*={{\,\mathrm{trace}\,}}(\varPhi )\). For matrices \(A,B\in {\mathbb R}^{N\times n}\), we can define the inner product on \({\mathbb R}^{N\times n}\times {\mathbb R}^{N\times n}\) as \(\langle A,B\rangle ={{\,\mathrm{trace}\,}}(A^TB)=\sum _{i=1}^N\sum _{j=1}^nA_{i,j}B_{i,j}\). So the Frobenius norm is the norm associated with this inner product. The spectral norm is defined as the induced 2norm, i.e., for \(\varPhi \in {\mathbb R}^{N\times n}\),
To show that (3.141) is equal to (3.140), note that \(\max _{\Vert \theta \Vert _2=1} \Vert \varPhi \theta \Vert _2\) is equivalent to \(\max _{\Vert \theta \Vert _2^2=1} \Vert \varPhi \theta \Vert _2^2\), which is further equivalent to
where \(\lambda \) is the Lagrange multiplier. Checking the optimality condition of (3.142) yields that the optimal solution will satisfy
The above equation implies that \(\lambda \) is an eigenvalue of \(\varPhi ^T\varPhi \), and moreover,
As a result, we have
where \(\lambda _{\text {max}}\) is the largest eigenvalue of \(\varPhi ^T\varPhi \) that is equal to \(\sigma _{\text {max}}^2(\varPhi )\). Thus (3.141) is indeed equal to (3.140).
The aforementioned three matrix norms, the nuclear norm, the Frobenius norm and the spectral norm, can be seen as natural extensions of the three vector norms: the \(\ell _1\), \(\ell _2\) and \(\ell _\infty \) norms, , respectively. In particular, if we construct an ndimensional vector with the n singular values of \(\varPhi \) as its elements, then the three matrix norms \(\Vert \varPhi \Vert _*,\Vert \varPhi \Vert _{\text {F}}\) and \(\Vert \varPhi \Vert _2\) correspond to the \(\ell _1\), \(\ell _2\) and \(\ell _\infty \) norms of the constructed vector, respectively. Moreover, for any given norm \(\Vert \cdot \Vert \) on \({\mathbb R}^{N\times n}\), there exists a dual norm \(\Vert \cdot \Vert _\text {d}\) of \(\Vert \cdot \Vert \) defined as
For the vector norms, the dual norm of the \(\ell _1\) norm is the \(\ell _\infty \) norm and the dual norm of the \(\ell _2\) norm is the \(\ell _2\) norm. The properties for the vector norms extend to the matrix norms we have defined: the dual norm of the nuclear norm is the spectral norm, see, e.g., [37], and the dual norm of the Frobenius norm is itself.
3.8.1.3 Matrix Inversion Lemma, Based on [49]
The matrix inversion lemma is also known as Sherman–Morrison–Woodbury formula and refers to the following identity:
where A and C are square \(n \times n\) and \(m \times m\) matrices.
3.8.2 Proof of Lemma 3.1
Define \(W=(QR+I_n)^{1}\) and \(W_0=(ZR+I_n)^{1}\). Then (3.67) can be rewritten as
Note that
thus (3.67) can be further rewritten as
In the following, we show that
Simple calculation shows that (3.149) is equivalent to
It follows from the second equation of (3.147) that
Now inserting (3.151) into the lefthand side of (3.150) shows that (3.150) and thus (3.149) holds. Moreover, since \((WW_0)(R^{1}+Z)(WW_0)^T\) in (3.149) is positive semidefinite, Eq. (3.148) holds as well, which in turn implies (3.67) holds. This completes the proof.
3.8.3 Derivation of Predicted Residual Error Sum of Squares (PRESS)
For the case when the kth measured output \(y_k\), \(k=1,\ldots ,N\), is not used, the corresponding ReLSQ estimate becomes
For the kth measured output \(y_k\), \(k=1,\ldots ,N\), the corresponding predicted output \(\hat{y}_{k}\) and validation error \(r_{k}\) are
With M defined in (3.81) and by Woodbury matrix identity, e.g., [10, 17], we have
Then we have
which shows that \(r_{k}\) is actually obtained by scaling \(r_k\) with a factor \(1/(1\phi _k^TM^{1}\phi _k)\). Accordingly, we have the sum of squares of the validation errors
Then the PRESS (3.80) is obtained by minimizing (3.156) with respect to \(\eta \in \varGamma \).
3.8.4 Proof of Theorem 3.7
Using (3.92) and (3.100), it is easy to see that proving (3.90) is equivalent to show that
and to prove the above inequality we need the following lemma.
Lemma 3.3
Consider the following additive measurement model:
where \(\mu \) is an unknown constant vector and \(\varepsilon \) is a random variable with zeromean and covariance matrix \({\mathscr {E}}(\varepsilon \varepsilon ^T)=\varSigma \). Let \(\hat{\mu }(x)\) be an estimator of \(\mu \) based on the data x and let \(\tilde{x}\) be new data generated from
where \(\tilde{\varepsilon }\) is a random variable uncorrelated with \(\varepsilon \) and has zeromean and covariance matrix \({\mathscr {E}}(\tilde{\varepsilon }\tilde{\varepsilon }^T)=\varSigma \). Then it holds that
where the expectation is over both \(\varepsilon \) and \(\tilde{\varepsilon }\).
Proof
Firstly, we consider (3.160). We have
which shows that (3.160) is true.
Secondly, we consider (3.161). Similarly, we have
which implies that (3.161) is true. \(\square \)
Now we prove (3.157) by applying Lemma 3.3. Let
and then it follows from (3.161) that
Next we show that the righthand side of (3.163) is nonnegative. For the ReLSQ problem (3.58a) with the ReLSQ estimate (3.58b), the predicted output \(\hat{Y}(\eta )\) of Y is
Then we have
where H is the hat matrix defined in (3.63). One has
Therefore, the righthand side of (3.163) is nonnegative and thus (3.90) holds true completing the proof of Theorem 3.7.
3.8.5 A Variant of the Expected InSample Validation Error and Its Unbiased Estimator
It is possible to derive variants of the expected insample validation error and its unbiased estimator by modifying (3.92) and (3.100).
Assume that \(\varPhi \) is full rank, i.e., \({{\,\mathrm{rank}\,}}(\varPhi )=n\). Then, multiplying both sides of (3.92) and (3.100) with \((\varPhi ^T\varPhi )^{1}\varPhi ^T\) yields
which will be our new “true system” and new “validation data”, respectively.
Different from (3.162), we now take
Note that \(\hat{\theta }^{\text {LS}}=(\varPhi ^T\varPhi )^{1}\varPhi ^TY\) and then it follows from (3.160) and (3.161) that
From the above two equations, we have
Further note that
then we have
Note that \({\mathscr {E}}(\Vert \hat{\theta }^{\text {R}}(\eta )\theta _0\Vert _2^2)\) is equal to \({{\,\mathrm{trace}\,}}(\text {MSE}({\hat{\theta }^{\text {R}}(\eta )},\theta _0))\), then we denote it by \(\text {mse}_{\eta }\) and we readily obtain an unbiased estimator of \(\text {mse}_{\eta }\) as follows:
Now given the training data (3.84), the corresponding estimate \(\widehat{\text {mse}_{\eta }} \) of \(\text {mse}_{\eta }\) can be used to estimate the hyperparameter \(\eta \): we should take the value of \(\eta \in \varGamma \) that minimizes (3.169), i.e.,
The criterion (3.170) is known as the SURE of the expected insample validation error for the true system (3.165) and the validation data (3.166), e.g., [33, 40].
Notes
 1.
Recall that the column rank (resp., the row rank) of a matrix is the dimension of the space spanned by the columns (resp., the rows) of the matrix. It is a fundamental result in linear algebra that the column rank and the row rank of a matrix are always equal and this number is called the rank of the matrix. A matrix is said to be full rank if its rank is equal to the lesser of the number of rows and columns and a matrix is said to be rank deficient otherwise.
 2.
References
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control AC–19:716–723
Allen DM (1974) The relationship between variable selection and data augmentation and a method for prediction. Technometrics 16(1):125–127
Andersen MS, Dahl J, Vandenberghe L (2012) CVXOPT: a Python package for convex optimization, version 1.1.5. http://abel.ee.ucla.edu/cvxopt
Bishop CM (2006) Pattern recognition and machine learning. Springer, New York
Candès EJ, Recht B (2009) Exact matrix completion via convex optimization. Found Comput Math 9(6):717
Candès EJ, Tao T (2010) The power of convex relaxation: nearoptimal matrix completion. IEEE Trans Inf Theory 56(5):2053–2080
Chen T (2018) On kernel design for regularized LTI system identification. Automatica 90:109–122
Chen T, Ljung L (2013) Implementation of algorithms for tuning parameters in regularized least squares problems in system identification. Automatica 49:2213–2220
Chen T, Ohlsson H, Ljung L (2012) On the estimation of transfer functions, regularizations and Gaussian processes  revisited. Automatica 48:1525–1535
Demmel JW (1997) Applied numerical linear algebra. SIAM, Philadelphia
Diamond S, Boyd S (2016) CVXPY: a Pythonembedded modeling language for convex optimization. J Mach Learn Res 17:1–5
Draper NR, Smith H (1981) Applied regression analysis, 2nd edn. Wiley, New York
Fazel M (2002) Matrix rank minimization with applications. PhD thesis, Department of Electrical Engineering, Stanford University
Fazel M, Hindi H, Boyd SP (2001) A rank minimization heuristic with application to minimum order system approximation. In: Proceedings of the 2001 American control conference, pp 4734–4739
Fazel M, Hindi H, Boyd SP (2003) Logdet heuristic for matrix rank minimization with applications to Hankel and Euclidean distance matrices. In: Proceedings of the 2003 American control conference, vol 3, pp 2156–2162
Friedman J, Hastie T, Tibshirani R (2008) Sparse inverse covariance estimation with the graphical Lasso. Biostatistics 9(3):432–441
Golub GH, Van Loan CF (2013) Matrix computations, 4th edn. The Johns Hopkins University Press, Baltimore
Golub GH, Heath M, Wahba G (1979) Generalized crossvalidation as a method for choosing a good ridge parameter. Technometrics 21(2):215–223
Grant M, Boyd S, Ye Y (2009) MATLAB software for disciplined convex programming
Grenander U, Szegö G (1956) Toeplitz forms and their applications, vol 321. University of California Press
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer, Berlin
HiriartUrruty JB, Lemaréchal C (1993) Convex analysis and minimization algorithms II: advanced theory and bundle methods
Huber PJ (1981) Robust statistics. Wiley, New York
Kohavi R (1995) A study of crossvalidation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, San Francisco, CA, USA, pp 1137–1143
Ljung L (1999) System identification  theory for the user, 2nd edn. PrenticeHall, Upper Saddle River
Lofberg J (2004) YALMIP: a toolbox for modeling and optimization in MATLAB. In: 2004 IEEE international symposium on computer aided control systems design, pp 284–289
Mallows CL (1973) Some comments on CP. Technometrics 15(4):661–675
Meinshausen N, Buhlmann P (2006) Highdimensional graphs and variable selection with the Lasso. Ann Stat 34(3):1436–1462
Mu B, Chen T, Ljung L (2018) On asymptotic properties of hyperparameter estimators for kernelbased regularization methods. Automatica 94:381–395
Mu B, Chen T, Ljung L (2018) Asymptotic properties of hyperparameter estimators by using crossvalidations for regularized system identification. In: Proceedings of the 57th IEEE conference on decision and control, pp 644–649
Natarajan BK (1995) Sparse approximate solutions to linear systems. SIAM J Comput 24(2):227–234
Park T, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103(482):681–686
Pillonetto G, Chiuso A (2015) Tuning complexity in regularized kernelbased regression and linear system identification: the robustness of the marginal likelihood estimator. Automatica 58:106–117
Rao BD, Engan K, Cotter SF, Palmer J, KreutzDelgado K (2003) Subset selection in noise based on diversity measure minimization. IEEE Trans Signal Process 51(3):760–770
Rao CR (1973) Linear statistical inference and its applications. Wiley, New York
Rasmussen CE, Williams CKI (2006) Gaussian processes for machine learning. The MIT Press, Cambridge
Recht B, Fazel M, Parrilo P (2010) Guaranteed minimumrank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev 52(3):471–501
Richard E, Savalle P, Vayatis N (2012) Estimation of simultaneously sparse and low rank matrices. In: The 29th international conference on machine learning (ICML)
Rissanen J (1978) Modelling by shortest data description. Automatica 14:465–471
Stein C (1981) Estimation of the mean of a multivariate normal distribution. Ann Stat 9:1135–1151
Stone M (1974) Crossvalidatory choice and assessment of statistical predictions. J R Stat Soc Ser B Stat Methodol 111–147
Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B Stat Methodol 58:267–288
Tibshirani R, Taylor J (2012) Degrees of freedom in Lasso problems. Ann Stat 40(2):1198–1232
Toeplitz O (1911) Zur theorie der quadratischen und bilinearen formen von unendlichvielen veränderlichen. Math Ann 70(3):351–376
Vapnik V (1998) Statistical learning theory. Wiley, New York
Wahba G (1990) Spline models for observational data. SIAM, Philadelphia
Wahba G (1999) Support vector machines, reproducing kernel Hilbert spaces, and the randomized GACV. In: Scholkopf B, Burges C, Smola A (eds) Advances in kernel methods  support vector learning. MIT Press, Cambridge, pp 69–88
Wolter KM (2007) Introduction to variance estimation, 2nd edn. Springer, Berlin
Woodbury MA (1950) Inverting modified matrices. Memorandum Rept. 42. Princeton University, Princeton
Zou H (2006) The adaptive Lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429
Zou H, Hastie T, Tibshirani R (2007) On the degrees of freedom of the Lasso. Ann Stat 35(5):2173–2192
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2022 The Author(s)
About this chapter
Cite this chapter
Pillonetto, G., Chen, T., Chiuso, A., De Nicolao, G., Ljung, L. (2022). Regularization of Linear Regression Models. In: Regularized System Identification. Communications and Control Engineering. Springer, Cham. https://doi.org/10.1007/9783030958602_3
Download citation
DOI: https://doi.org/10.1007/9783030958602_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783030958596
Online ISBN: 9783030958602
eBook Packages: EngineeringEngineering (R0)