Regularization of Linear Regression Models

Pillonetto, Gianluigi; Chen, Tianshi; Chiuso, Alessandro; De Nicolao, Giuseppe; Ljung, Lennart

doi:10.1007/978-3-030-95860-2_3

Gianluigi Pillonetto¹⁰,
Tianshi Chen¹¹,
Alessandro Chiuso¹⁰,
Giuseppe De Nicolao¹² &
…
Lennart Ljung¹³

Part of the book series: Communications and Control Engineering ((CCE))

5659 Accesses

Abstract

Linear regression models are widely used in statistics, machine learning and system identification. They allow to face many important problems, are easy to fit and enjoy simple analytical properties. The simplest method to fit linear regression models is least squares whose systematic treatment is available in many textbooks, e.g., [35, Chap. 4], [12]. Linear regression models can be fitted also in different way and a class of methods that we will consider in this chapter is the so-called regularized least squares. It is an extension of least squares which minimizes the sum of the square loss function and a regularization term. This latter can take various forms, leading to several variants which have been applied extensively in theory as well as in practical applications. In this chapter, we will focus on these methods and introduce their fundamentals. In the first part of the appendix to this chapter, we also report some basic results of linear algebra useful for the reading.

You have full access to this open access chapter, Download chapter PDF

3.1 Linear Regression

Regression theory is concerned with modelling relationships among variables. It is used for predicting one dependent variable based on the information provided by one or more independent variables. In linear regression, the relationship among variables is given by linear functions. To illustrate this, we start from the function estimation problem because it is intuitive and easy to understand.

The aim of function estimation is to reconstruct a function $g:{\mathbb R}^n\rightarrow {\mathbb R}$ with $n\in \mathbb N$ from a collection of N measured values of g(x) and x which we denote, respectively, by $y_i$ and $x_i$ for $i=1,\ldots , N$. For generic values of x, the estimate $\hat{g}$ should give a good prediction $\hat{g}(x)$ of g(x). The variables x and g(x) are often called the input and the output variable or simply the input and the output, respectively. The collection of measured values of x and g(x), given by the couples $\{x_i,y_i\}$, is called the data set or also the training set. In practical applications, the measurement $y_i$ is often not precise and subject to some disturbance, i.e., for a given input $x_i$ there is often discrepancy between $g(x_i)$ and its measured value $y_i$. To describe this phenomenon, it is natural to introduce a disturbance variable $e\in {\mathbb R}$ and assume that, for any given $x\in {\mathbb R}^n$, the measured value of g(x) is

$$\begin{aligned} y=g(x)+e. \end{aligned}$$

(3.1)

Hence, y is the measured output and g(x) is the noise-free or true output. Accordingly, the training data $\{x_i,y_i\}_{i=1}^N$ are collected as follows:

$$\begin{aligned} y_i=g(x_i)+e_i, \quad i=1,\ldots , N. \end{aligned}$$

(3.2)

We are interested in linear regression models for estimation of g. For illustration, an example is now introduced.

Example 3.1

(Polynomial regression) We consider $g:[0,1]\rightarrow {\mathbb R}$ and assume that such function is smooth. Then, g can be well approximated by polynomials with a certain order. In this case, a linear regression model for the function estimation problem takes the following form:

$$\begin{aligned} y_i=\theta _1 + \sum _{k=2}^{n} \theta _{k} x_i^{k-1}+e_i, \quad i=1,\ldots , N, \end{aligned}$$

(3.3)

where $\theta _k\in {\mathbb R}$ for $k=1,\ldots ,n$. Defining

$$\begin{aligned} \phi (x_i) = [\begin{array}{cccc} 1&x_i&\dots&x_i^{n-1} \end{array}]^T,\quad \theta = [\begin{array}{cccc} \theta _1&\theta _2&\dots&\theta _n \end{array}]^T, \end{aligned}$$

(3.4)

where, for a real-valued matrix A, the notation $A^T$ denotes its matrix transpose, we rewrite (3.3) as

$$\begin{aligned} y_i=\phi (x_i)^T \theta +e_i, \quad i=1,\ldots , N \end{aligned}$$

(3.5)

obtaining a more compact expression. $\square $

Although (3.5) is derived from Example 3.1, it is the general linear regression model studied in the theory of regression. For convenience, we remove the dependence of $\phi (x_i)$ on $x_i$ and simply write $\phi (x_i)$ as $\phi _i$, when the context is clear. In addition, all the vectors are column vectors. Then, model (3.5) becomes

$$\begin{aligned} y_i=\phi _i^T \theta +e_i, \quad i=1,\ldots , N,\quad y_i\in {\mathbb R}, \ \phi _i\in {\mathbb R}^n, \ \theta \in {\mathbb R}^n, \ e_i\in {\mathbb R}. \end{aligned}$$

(3.6)

In what follows, we will focus on (3.6) and introduce the linear regression problem, the methods of least squares and regularized least squares. We will call $y_i\in {\mathbb R}$ the measured output, $\phi _i\in {\mathbb R}^n$ the regressor, $\theta \in {\mathbb R}^n$ the model parameter, n the model order, and $e_i$ the measurement noise.

Before proceeding, it should be noted that the choice of the model order n is a critical problem in practical applications. The rule of thumb is to set n to a large enough value such that g can be represented by the proposed model structure. In system identification, this corresponds to introducing a model structure flexible enough to contain the true system. Consider, e.g., Example 3.1 again and assume that the function g is actually a polynomial of order 5. Clearly, if the dimension of $\theta $ does not satisfy $n \ge 6$, then $x^5$ cannot be represented and some model bias will affect the estimation process. However, the order n should not be chosen larger than necessary, because this can increase the variance of the model estimate. This problem is actually the same as model selection complexity in the classical system identification and is connected with the bias-variance trade-off illustrated in the first two chapters and also discussed in more detail shortly.

Also in light of the above discussion, we often assume that the model order n is either large enough for g to be adequately represented by the proposed model or even that a true model parameter that has generated the data exists, denoted by $\theta _0\in {\mathbb R}^n$. Hence, we can formulate linear regression as the problem of obtaining an estimate $\hat{\theta }$ such that, given a new regressor $\phi \in {\mathbb R}^n$, the prediction $\phi ^T \hat{\theta }$ is close to $\phi ^T \theta _0$.

3.2 The Least Squares Method

There are many methods to estimate $\theta $ in the linear regression model (3.6). In this section, we consider the least squares (LS) method.

3.2.1 Fundamentals of the Least Squares Method

Given the data $y_i,\phi _i$ for $i=1,\ldots ,N$, one way to estimate $\theta $ is to minimize the least squares (LS) criterion:

$$\begin{aligned} \hat{\theta }^{\text {LS}} = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_\theta \ l(\theta ),\qquad l(\theta ) =\sum _{i=1}^N (y_i-\phi _i^T\theta )^2, \end{aligned}$$

(3.7)

where $l(\theta )$ is the LS criterion and $\hat{\theta }^{\text {LS}}$ is the LS estimate of $\theta $. Then, the predicted output $\hat{y}$ for the value of $\phi ^T \theta _0$ with $\phi \in {\mathbb R}^n$ is obtained as

$$\begin{aligned} \hat{y} = \phi ^T \hat{\theta }^{\text {LS}}. \end{aligned}$$

(3.8)

3.2.1.1 Normal Equations and LS Estimate

The LS estimate $\hat{ \theta }^{\text {LS}}$ given by (3.7) has a closed-form expression. To see this, note that the first- and second-order derivatives of $l(\theta )$ with respect to $\theta $ are

$$\begin{aligned} \frac{\partial l(\theta )}{\partial \theta } =2 \sum _{i=1}^N \phi _i\phi _i^T\theta -2\sum _{i=1}^N \phi _iy_i, \quad \frac{\partial ^2 l(\theta )}{\partial \theta \partial \theta } = 2\sum _{i=1}^N \phi _i\phi _i^T\succcurlyeq 0, \end{aligned}$$

(3.9)

where $A\succcurlyeq 0$ means that A is a positive semidefinite matrix. Then all $\hat{\theta }^{\text {LS}}$ that satisfy

$$\begin{aligned} \left[ \sum _{i=1}^N \phi _i\phi _i^T\right] \hat{\theta }^{\text {LS}} =\sum _{i=1}^N \phi _iy_i \end{aligned}$$

(3.10)

are global minima of $l(\theta )$. The set of Eqs. (3.10) is known as the normal equations. For the time being, we assume that $\sum _{i=1}^N \phi _i\phi _i^T$ is full rank.^{Footnote 1} Then

$$\begin{aligned} \hat{\theta }^{\text {LS}} =\left[ \sum _{i=1}^N \phi _i\phi _i^T\right] ^{-1}\sum _{i=1}^N \phi _iy_i. \end{aligned}$$

(3.11)

3.2.1.2 Matrix Formulation

It is often convenient to rewrite the LS method in matrix form. To this goal, let

$$\begin{aligned} Y=\left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \\ \end{array} \right] ,\ \varPhi =\left[ \begin{array}{c} \phi _1^T \\ \phi _2^T \\ \vdots \\ \phi _N^T \\ \end{array} \right] ,\ E=\left[ \begin{array}{c} e_1 \\ e_2 \\ \vdots \\ e_N \\ \end{array} \right] . \end{aligned}$$

(3.12)

We can then rewrite (3.6) with the $\theta _0$ that generated the data, the LS criterion (3.7), the normal Eqs. (3.10) and the LS estimate (3.11) in matrix form, respectively:

$$\begin{aligned} Y&= \varPhi \theta _0+E \end{aligned}$$

(3.13)

$$\begin{aligned} \hat{\theta }^{\text {LS}}&=\displaystyle \mathop {{\text {arg}}\,{\text {min}}}_\theta \ l(\theta ),\qquad l(\theta )=\Vert Y-\varPhi \theta \Vert _2^2 \end{aligned}$$

(3.14)

$$\begin{aligned} \varPhi ^T\varPhi \hat{\theta }^{\text {LS}}&=\varPhi ^TY \end{aligned}$$

(3.15)

$$\begin{aligned} \hat{\theta }^{\text {LS}}&=(\varPhi ^T\varPhi )^{-1}\varPhi ^TY, \end{aligned}$$

(3.16)

where $\Vert \cdot \Vert _2$ is the Euclidean norm, i.e., the 2-norm, and $\varPhi $ is called the regression matrix.

3.2.2 Mean Squared Error and Model Order Selection

3.2.2.1 Bias, Variance, and Mean Squared Error of the LS Estimate

We study the linear regression problem in a probabilistic framework, assuming that data are generated according to (3.13) and that

$$\begin{aligned} \text {the measurement noises}\, e_i,\, \text {for }\, i=1,\ldots ,N, \text {are i.i.d. with mean}\, 0\, \text {and variance}\, \sigma ^2. \end{aligned}$$

(3.17)

Due to this assumption, the LS estimator $\hat{\theta }^{\text {LS}}$, as well as any estimator of $\theta $ dependent on the data, becomes random variables. Then, it is interesting to study the statistical properties of $\hat{\theta }^{\text {LS}}$, such as the bias, variance and mean squared error (MSE).

All the expectations reported below are computed with respect to the noises $e_i$ with the regressors $\phi _i$ assumed to be deterministic. Simple calculations lead to

$$\begin{aligned} {\mathscr {E}}(\hat{\theta }^{\text {LS}})&= \theta _0\end{aligned}$$

(3.18a)

$$\begin{aligned} \hat{\theta }_{\text {bias}}^{\text {LS}}&={\mathscr {E}}(\hat{\theta }^{\text {LS}})-\theta _0= 0 \end{aligned}$$

(3.18b)

$$\begin{aligned} \text {Cov}(\hat{\theta }^{\text {LS}},\hat{\theta }^{\text {LS}})&={\mathscr {E}}[(\hat{\theta }^{\text {LS}}-{\mathscr {E}}( \hat{\theta }^{\text {LS}}))(\hat{\theta }^{\text {LS}}-{\mathscr {E}}( \hat{\theta }^{\text {LS}}))^T] =\sigma ^2(\varPhi ^T\varPhi )^{-1}\end{aligned}$$

(3.18c)

$$\begin{aligned} \text {MSE}(\hat{\theta }^{\text {LS}}, \theta _0)&={\mathscr {E}}[(\hat{\theta }^{\text {LS}}-\theta _0) (\hat{\theta }^{\text {LS}}-\theta _0)^T]\nonumber \\&=\text {Cov}(\hat{\theta }^{\text {LS}},\hat{\theta }^{\text {LS}})+\hat{\theta }_{\text {bias}}^{\text {LS}}(\hat{\theta }_{\text {bias}}^{\text {LS}})^T\nonumber \\&= \sigma ^2(\varPhi ^T\varPhi )^{-1}, \end{aligned}$$

(3.18d)

where $\text {Cov}(\hat{\theta }^{\text {LS}},\hat{\theta }^{\text {LS}})$ is the covariance matrix of $\hat{\theta }^{\text {LS}}$ and $\text {MSE}(\hat{\theta }^{\text {LS}}, \theta _0)$ is the MSE matrix of $\hat{\theta }^{\text {LS}}$ function of the true model parameter $\theta _0$.

3.2.2.2 Model Order Selection

The issue of model order selection is essentially the same as that of model complexity selection in the classical system identification scenario. Therefore, the techniques introduced in Sect. 2.4.3 can be used to choose the model order n, e.g., Akaike’s information criterion (AIC) [1], the Bayesian Information criterion (BIC) or Minimum Description Length (MDL) approach [25, 39].

The quality of the LS estimate $\hat{\theta }^{\text {LS}}$ depends on the adopted model order n. In practical applications, model complexity is in general unknown and needs to be determined from data. As the model order n gets larger, the fit to the data $\Vert Y-\varPhi \hat{\theta }^{\text {LS}}\Vert _2^2$ in (3.14) will become smaller, but the variances along the diagonal of the MSE matrix (3.18d) of $\hat{\theta }^{\text {LS}}$ will become larger at the same time. When assessing the quality of $\hat{\theta }^{\text {LS}}$, one way to account for the increasing variance is to introduce criteria that suitably modify the plain data fit. AIC and BIC are techniques following this idea and can be used for model order selection. More specifically, besides (3.17), further assuming that the errors are independent and Gaussian, i.e.,

$$\begin{aligned} e_i\sim \mathscr {N} (0,\sigma ^2), \quad i=1,\ldots ,N \end{aligned}$$

(3.19)

with known noise variance $\sigma ^2$, we obtain

$$\begin{aligned}&\text {AIC:}\qquad \hat{\theta }^{\text {LS}} = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{\theta \in {\mathbb R}^n} \frac{1}{N}\Vert Y-\varPhi \theta \Vert _2^2 + 2\sigma ^2\frac{n}{N},\end{aligned}$$

(3.20)

$$\begin{aligned}&\text {BIC or MDL:}\qquad \hat{\theta }^{\text {LS}} = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{\theta \in {\mathbb R}^n} \frac{1}{N}\Vert Y-\varPhi \theta \Vert _2^2 + \log (N)\sigma ^2\frac{n}{N}, \end{aligned}$$

(3.21)

where the minimization also takes place over a family of model structures with different dimension n of $\theta $.

Another way is to estimate the prediction capability of the model on some unseen data which are not used for model estimation. As briefly seen in Sect. 2.6.3, cross-validation (CV) exploits this idea and is among the most widely used techniques for model selection. Recall that hold out CV is the simplest form of CV with data divided into two parts. One part is used to estimate the model with different model orders and the other part is used to assess the prediction capability of each model through the prediction score $\Vert Y_{\text {v}}-\varPhi _{\text {v}}\hat{\theta }^{\text {LS}}\Vert _2^2$. Here, $Y_\text {v},\varPhi _\text {v}$ are the validation data which are different from those used to derive $\hat{\theta }^{\text {LS}}$. The model order giving the best prediction score will be chosen.

The noise variance $\sigma ^2$ of the measurement noises $e_i$ plays an important role in statistical modelling, e.g., in the assessment of the variance of $\hat{\theta }^{\text {LS}}$ and in the model order selection using, e.g., AIC (3.20) or BIC (3.21). In practical applications, the noise variance $\sigma ^2$ is in general unknown and needs to be estimated from the data Y and $\varPhi $. It can be estimated in different ways based on the maximum likelihood estimation (MLE) method or the statistical property of $\hat{\theta }^{\text {LS}}$.

Under (3.17) and the Gaussian assumption (3.19), the ML estimate of $\sigma ^2$, as given in [25, p. 506], is

$$\begin{aligned} \hat{\sigma }^{2,\text {ML}} = \frac{1}{N}\Vert Y-\varPhi \hat{\theta }^{\text {LS}}\Vert _2^2. \end{aligned}$$

(3.22)

Using only assumption (3.17), an unbiased estimator of $\sigma ^2$, as given in [25, p. 554], turns out

$$\begin{aligned} \hat{\sigma }^{2} = \frac{1}{N-n}\Vert Y-\varPhi \hat{\theta }^{\text {LS}}\Vert _2^2. \end{aligned}$$

(3.23)

AIC and BIC were reported, respectively, in (3.20) and (3.21) assuming known noise variance. When $\sigma ^2$ is unknown, the use of the ML estimate (3.22) leads to the widely used AIC and BIC for Gaussian innovations, e.g., [25, pp. 506–507]:

$$\begin{aligned}&\text {AIC:}\qquad \hat{\theta }^{LS } = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{\theta \in {\mathbb R}^n} \ \log \left( \frac{1}{N}\Vert Y-\varPhi \theta \Vert _2^2\right) + 2\frac{n}{N},\end{aligned}$$

(3.24)

$$\begin{aligned}&\text {BIC or MDL:}\qquad \hat{\theta }^{LS } = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{\theta \in {\mathbb R}^n} \ \log \left( \frac{1}{N}\Vert Y-\varPhi \theta \Vert _2^2\right) + \log (N)\frac{n}{N}. \end{aligned}$$

(3.25)

Example 3.2

(Polynomial regression using LS and discrete model order selection) We apply the LS method and the model order selection techniques to polynomial regression as sketched in Example 3.1. Let the function g be

$$\begin{aligned} g(x)=\sin ^2(x)(1-x^2), \quad x\in [0,1]. \end{aligned}$$

(3.26)

Then, we generate the data as follows:

$$\begin{aligned} y_i = \sin ^2(x_i)(1-x_i^2) + e_i, \quad i=1,\ldots ,40, \end{aligned}$$

(3.27)

where $x_1=0,x_{40}=1$, the $x_2,\ldots ,x_{39}$ are evenly spaced points between $x_1$ and $x_{40}$, and the noises $e_i$ are i.i.d. Gaussian distributed with zero mean and standard deviation 0.034. The function g and the generated data are shown in Fig. 3.1.

The function g is smooth and can be well approximated by polynomials. However, it is unclear which order should be chosen. Hence, we test the values $n=1,\ldots ,15$ and, for each order n, we form the regressor (3.4), the linears regression model (3.13) and derive the LS estimate $\hat{\theta }^{\text {LS}}$. As shown in Fig. 3.2, as the order n increases the data fit $\Vert Y-\varPhi \hat{\theta }^{\text {LS}}\Vert _2^2$ keeps decreasing.

For model order selection, we use AIC (3.24), BIC (3.25) and hold out CV with $x_i,y_i$, $i=1,3,\ldots ,39$ for estimation and $x_i,y_i$, $i=2,4,\ldots ,40$ for validation. Figure 3.3 plots the values of AIC (3.24), BIC (3.25) and the prediction score of hold out CV. The order n selected by AIC and BIC are the same and equal to 3 while that selected by hold out CV is 7.

To evaluate the performance of models of different complexity, we compute the fit measure

$$\begin{aligned} \mathscr {F}=100\left( 1 - \left[ \frac{\sum ^{40}_{k=1}|g(x_k)-\hat{g}(x_k)|^2 }{\sum ^{40}_{k=1}|g(x_k)-\bar{g}^0|^2}\right] ^{1/2}\right) ,\quad \bar{g}^0=\frac{1}{40}\sum ^{40}_{k=1}g(x_k). \end{aligned}$$

(3.28)

Note that $\mathscr {F}=100$ means a perfect agreement between g(x) and the corresponding estimate. The model fits for $n=1,\ldots ,15$ are shown in Fig. 3.4: the order $n=3$ gives the best prediction. Figure 3.5 plots the estimates of g(x) for $n=3,7,15$ over the $x_i$, $i=1,\ldots ,40$. Overfitting occurs when $n=15$, indicating that the corresponding model is too flexible and fooled by the noise. $\square $

3.3 Ill-Conditioning

3.3.1 Ill-Conditioned Least Squares Problems

When $\varPhi \in {\mathbb R}^{N\times n}$ with $N\ge n$ is rank deficient, i.e., ${{\,\mathrm{rank}\,}}( \varPhi ) < n$, or “close” to rank deficient, the corresponding LS problem is said to be ill-conditioned. Examples were already encountered in Sect. 1.1.2 to discuss some limitations of the James–Stein estimators and in Sect. 1.2 in the context of FIR models. There are different ways to handle ill-conditioned LS problems. Below, we show how to calculate $\hat{\theta }^{\text {LS}}$ more accurately by using the singular value decomposition (SVD).

3.3.1.1 Singular Value Decomposition

SVD is a fundamental matrix decomposition. Any matrix $\varPhi \in {\mathbb R}^{N\times n}$, with $N\ge n$ to simplify the exposition, can be decomposed as follows:

$$\begin{aligned} \varPhi = U\varLambda V^T, \end{aligned}$$

(3.29)

where $\varLambda $ is a rectangular diagonal matrix with nonnegative diagonal entries $\sigma _i$, $i=1,\ldots ,n$ and $U\in {\mathbb R}^{N\times N}$ and $V\in {\mathbb R}^{n\times n}$ are orthogonal matrices, i.e., such that $U^TU=UU^T=I_N$ and $V^TV=VV^T=I_n$. The factorization (3.29) is called the singular value decomposition of $\varPhi $ and the $\sigma _i$ are called the singular values of $\varPhi $. Without loss of generality, they can be assumed to be ordered according to their magnitude:

$$ \sigma _1\ge \sigma _2\ge \dots \ge \sigma _n\ge 0. $$

Since $\varPhi ^T \varPhi = V \varLambda ^T \varLambda V^T = V D^2 V^T$, where D is a square diagonal matrix whose diagonal entries are the $\sigma _i$, it follows that

$$\begin{aligned} \sigma _i = \sqrt{\lambda _i(\varPhi ^T \varPhi )}, \quad i=1,\ldots ,n, \end{aligned}$$

(3.30)

where $\lambda _i(A)$ denotes the ith eigenvalue of the matrix A.

3.3.1.2 Condition Number

The condition number of a matrix is a measure of how “close” is the matrix to rank deficient. When $\varPhi $ is an invertible square matrix, it is denoted by ${{\,\mathrm{cond}\,}}(\varPhi )$ below and defined as

$$\begin{aligned} {{\,\mathrm{cond}\,}}(\varPhi ) = \Vert \varPhi ^{-1}\Vert \Vert \varPhi \Vert , \end{aligned}$$

(3.31)

where $\Vert \cdot \Vert $ is a matrix norm, with the convention that ${{\,\mathrm{cond}\,}}(\varPhi )=\infty $ for singular $\varPhi $. For a generic $\varPhi \in {\mathbb R}^{N\times n}$, with SVD in the form (3.29), its condition number with respect to the 2-norm $\Vert \cdot \Vert _2$ is defined as

$$\begin{aligned} {{\,\mathrm{cond}\,}}(\varPhi ) = \frac{\sigma _{\text {max}}}{\sigma _\text {min}}, \end{aligned}$$

(3.32)

where $\sigma _{\text {max}}=\sigma _1$ and $\sigma _{\text {min}}=\sigma _n$ are the largest and smallest singular values of $\varPhi $, respectively. If we use the 2-norm $\Vert \cdot \Vert _2$ in (3.31), then (3.31) coincides with (3.32). Hereafter, the condition number of a matrix will be defined by (3.32).

3.3.1.3 Ill-Conditioned Matrix and LS Problem

The condition number of a matrix is important since it can be used to measure the sensitivity of the LS estimate to perturbations in the data. To be specific, let $\varPhi \in {\mathbb R}^{N\times n}$ be full rank and let $\delta Y$ denote a small componentwise perturbation in Y. The solution of the perturbed LS criterion becomes

$$\begin{aligned} \tilde{\theta }^{\text {LS}}_2 = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_\theta \Vert (Y+\delta Y)-\varPhi \theta \Vert _2^2. \end{aligned}$$

(3.33)

Then, it can be shown, e.g., [17, Chap. 5], [10, Chap. 3], that

$$\begin{aligned} \frac{\Vert \tilde{\theta }^{\text {LS}}_2-\hat{\theta }^{\text {LS}}_2\Vert _2}{\Vert \hat{\theta }^{\text {LS}}_2\Vert _2}\le {{\,\mathrm{cond}\,}}(\varPhi ) \varepsilon +O\left( \varepsilon ^2 \right) , \quad \varepsilon =\frac{\Vert \delta Y \Vert _2}{ \Vert Y \Vert _2}. \end{aligned}$$

(3.34)

So, the relative error bound depends on ${{\,\mathrm{cond}\,}}(\varPhi )$: the larger ${{\,\mathrm{cond}\,}}(\varPhi )$, the larger the relative error. One can thus say that the matrix $\varPhi $ (and the LS problem) with a small condition number is well conditioned, while the matrix $\varPhi $ (and the LS problem) with a large condition number is ill-conditioned. The condition number enters also more complex bounds on the relative error due to perturbations on the matrix $\varPhi $ [10, 17].

Example 3.3

(Effect of ill-conditioning on LS) Consider the linear regression model (3.13). Let

$$\begin{aligned} \varPhi = \frac{1}{2}\left[ \begin{array}{cc} 1 &{} 1 \\ 1+10^{-8} &{} 1-10^{-8} \\ \end{array} \right] ,\quad Y=\left[ \begin{array}{c} 1 \\ 1 \\ \end{array} \right] . \end{aligned}$$

(3.35)

The two singular values of $\varPhi $ are $\sigma _{\text {max}}=1$ and $\sigma _{\text {min}}=5\times 10^{-9}$, implying that ${{\,\mathrm{cond}\,}}(\varPhi )=2\times 10^{8}$. Thus, $\varPhi $ and the LS problem (3.14) are ill-conditioned.

Using the normal Eq. (3.15), we obtain the LS estimate $\hat{\theta }^{LS }_1$ in closed form:

$$\begin{aligned} \hat{\theta }^{LS }_1 = (\varPhi ^T\varPhi )^{-1}\varPhi ^TY=\varPhi ^{-1}Y = \left[ \begin{array}{c} 1 \\ 1 \\ \end{array} \right] . \end{aligned}$$

(3.36)

Now, suppose that there is a small perturbation $\delta Y$ in Y

$$\begin{aligned} \delta Y=\left[ \begin{array}{c} 0.01 \\ 0 \\ \end{array} \right] . \end{aligned}$$

(3.37)

Solving the normal Eq. (3.15) with Y replaced by $Y+\delta Y$ now gives

$$\begin{aligned} \hat{\theta }^{LS }_2=\left[ \begin{array}{c} 1.01-10^{6}\\ 1.01+10^{6} \end{array} \right] . \end{aligned}$$

(3.38)

So, when the LS problem (3.14) is ill-conditioned, a small perturbation in Y could cause a significant change in the LS estimate derived by solving the normal Eq. (3.15) directly. $\square $

Example 3.4

(Polynomial regression: ill-conditioned LS Problem) We revisit the polynomial regression Examples (3.26) and (3.27) stressing the dependence of the condition number on the polynomial complexity. In particular, Fig. 3.6 shows that the ill-conditioning of the regression matrix $\varPhi $ constructed according to (3.4) and (3.12) augments as the dimension n increases. This further points out the importance of a careful selection of the discrete model order to control the estimator’s variance when using LS. $\square $

3.3.1.4 LS Estimate Exploiting the SVD of $\varPhi $

In order to obtain more accurate LS estimates for ill-conditioned problems, one can use the SVD of $\varPhi $. Given $\varPhi \in {\mathbb R}^{N\times n}$ with $N\ge n$, we consider two cases:

$\varPhi $ is rank deficient, i.e., ${{\,\mathrm{rank}\,}}(\varPhi )<n$.
$\varPhi $ is full rank but has a very large condition number, i.e., ${{\,\mathrm{rank}\,}}(\varPhi )=n$ but ${{\,\mathrm{cond}\,}}(\varPhi )$ is very large.

For the rank-deficient case, we assume without loss of generality that ${{\,\mathrm{rank}\,}}(\varPhi )=m<n$. In this case, the LS problem does not have a unique solution. To get a special solution, we have to impose extra conditions on the solutions of the LS problem.

Let the singular value decomposition of $\varPhi $ be

$$\begin{aligned} \varPhi =U \varLambda V^T = \left[ \begin{array}{cc}U_1&U_2\end{array}\right] \left[ \begin{array}{cc} \varLambda _1 &{} 0 \\ 0 &{} 0 \end{array} \right] \left[ \begin{array}{cc}V_1&V_2\end{array}\right] ^T, \end{aligned}$$

(3.39)

where $\varLambda _1\in {\mathbb R}^{m \times m}$ is diagonal and positive definite while $U_1 \in {\mathbb R}^{N\times m}$ and $V_1 \in {\mathbb R}^{n\times m}$.

We now perform a change of coordinates in both the output and parameter space

$$\begin{aligned} \tilde{Y} = U^T Y = \left[ \begin{array}{c} U_1^T Y \\ U_2^T Y \end{array}\right] =\left[ \begin{array}{c} \tilde{Y}_1 \\ \tilde{Y}_1 \end{array}\right] , \qquad \tilde{\theta }= V^T \theta =\left[ \begin{array}{c} V_1^T \theta \\ V_2^T \theta \end{array}\right] =\left[ \begin{array}{c} \tilde{\theta }_1 \\ \tilde{\theta }_1 \end{array}\right] . \end{aligned}$$

Note that both $\tilde{Y}_1$ and $\tilde{\theta }_1$ are m-dimensional vectors. In the new coordinates, the residual vector is

$$\begin{aligned} U^T \left( Y-\varPhi \theta \right) = \tilde{Y} - \varLambda \tilde{\theta }= \left[ \begin{array}{c} \tilde{Y}_1 - \varLambda _1 \tilde{\theta }_1 \\ \tilde{Y}_2 \end{array}\right] . \end{aligned}$$

The LS criterion can be rewritten as

$$\begin{aligned} \Vert Y - \varPhi \theta \Vert ^2 =(Y - \varPhi \theta )^T U U^T (Y - \varPhi \theta )=\Vert \tilde{Y} - \varLambda \tilde{\theta }\Vert ^2 = \Vert \tilde{Y}_1 - \varLambda _1 \tilde{\theta }_1 \Vert ^2 + \Vert \tilde{Y}_2 \Vert ^2 \end{aligned}$$

and is minimized by

$$\begin{aligned} \tilde{\theta }^{\text{ LS }}= \left[ \begin{array}{cc}\tilde{\theta }_1^{\text{ LS }} \\ \tilde{\theta }_2^{\text{ LS }} \end{array}\right] = \left[ \begin{array}{cc}\varLambda ^{-1} \tilde{Y}_1 \\ \tilde{\theta }_2\end{array}\right] , \end{aligned}$$

(3.40)

where $\tilde{\theta }_2 \in {\mathbb R}^{n-m}$ is an arbitrary vector. To get the minimum norm solution, one can set $\tilde{\theta }_2=0$ that, turning back to the original coordinates, yields

$$\begin{aligned} \hat{ \theta }^{\text{ LS }} =V \tilde{\theta }^{\text{ LS }} = V_1\varLambda _1^{-1}U_1^TY. \end{aligned}$$

(3.41)

Interestingly, for the rank-deficient case, the special solution (3.41) relates to the Moore–Penrose pseudoinverse of $\varPhi $, defined as

$$\begin{aligned} \varPhi ^+= V \varSigma ^+ U^T = \left[ \begin{array}{cc}V_1&V_2\end{array}\right] \left[ \begin{array}{cc} \varLambda _1^{-1} &{} 0 \\ 0 &{} 0 \end{array} \right] \left[ \begin{array}{cc}U_1&U_2\end{array}\right] ^T= V_1 \varSigma _1^{-1} U_1^T. \end{aligned}$$

So, given a matrix $\varSigma $, its pseudoinverse $\varSigma ^+$ is obtained by replacing all the nonzero diagonal entries by their reciprocal and transposing the resulting matrix. When ${{\,\mathrm{rank}\,}}(\varPhi )=n$, the pseudoinverse returns the usual (unique) LS solution

$$\begin{aligned} \varPhi ^+=\left( \varPhi ^T \varPhi \right) ^{-1} \varPhi ^T. \end{aligned}$$

It follows that the minimum norm solution among the general solutions of the LS problem (3.14) can be always written as

$$\begin{aligned} \hat{\theta }^{\text {LS}}=\varPhi ^+Y. \end{aligned}$$

For the rank-deficient case, due to roundoff errors, $\varPhi $ may have some very small computed singular values other than the m singular values contained in $\varLambda _1$ in (3.39). The situation is similar to the case where $\varPhi $ is full rank but with a very large condition number. Note also that the rank of $\varPhi $ needs to be known beforehand to compute the SVD of $\varPhi $. However, numerical determination of the rank of a matrix is nontrivial (and out of scope of this book). Here, we just mention a simple way to deal with these issues by using the so-called truncated SVD.

Consider the SVD (3.39) and, without loss of generality, assume

$$ \varLambda = {{\,\mathrm{diag}\,}}(\sigma _1,\sigma _2,\ldots ,\sigma _n) \ \ \text{ with } \ \ \sigma _1\ge \sigma _2\ge \dots \ge \sigma _n\ge 0. $$

Now set $\hat{\sigma }_i=\sigma _i$ if $\sigma _i>tol $ and $\hat{\sigma }_i=0$ otherwise. Then

$$\begin{aligned} {\hat{\varPhi }}=U{\hat{\varLambda }} V^T, \end{aligned}$$

(3.42)

where $\hat{ \varLambda } \in {\mathbb R}^{N\times n}$ is diagonal with entries $\hat{ \sigma }_1,\hat{ \sigma }_2,\ldots , \hat{ \sigma }_n$, is called the truncated SVD of $\varPhi $. So, the truncated SVD (3.42) can be used to handle the case where $\varPhi $ has full rank but large condition number: for a given tol, it suffices to replace $\varPhi $ with $\hat{ \varPhi }$ and then to compute the LS estimate of $\theta $ by means of $\hat{ \varPhi }^+ Y$.

Example 3.5

(Truncated SVD) We revisit Example 3.3 by making use of the truncated SVD of $\varPhi $. We take the user-supplied measure of uncertainty tol to be 1e-7. Then the LS estimate $\hat{\theta }^{LS }_3$ computed by (3.41) with Y replaced by $Y+\delta Y$ becomes

$$\begin{aligned} \hat{\theta }^{LS }_3 =\hat{ \varPhi }^+(Y+\delta Y) = \left[ \begin{array}{c} 1.0050\\ 1.0049 \end{array} \right] . \end{aligned}$$

(3.43)

One can thus see that the estimate is now very close to $[1 \ 1 ]^T$ which was the one obtained in absence of the perturbation $\delta Y$. $\square $

3.3.2 Ill-Conditioning in System Identification

In Sect. 1.2 we have illustrated an ill-conditioned system identification problem. Below, we will see that the difficulty was due to the fact that low-pass filtered inputs may induce regression matrices with large $\text {cond}(\varPhi )$.

Consider the FIR model of order n:

$$\begin{aligned} y(t) = \sum _{k=1}^n g_k u(t-k)+e(t), \quad t=1,\ldots ,N, \end{aligned}$$

(3.44)

which can be written in the form (3.13) as follows:

$$\begin{aligned} \begin{aligned}&Y=\varPhi \theta _0 +E\\&Y=\left[ \begin{array}{c} y(1) \\ y(2) \\ \vdots \\ y(N) \\ \end{array} \right] ,\ \varPhi =\left[ \begin{array}{cccc} u(0)&{} u(-1)&{} \cdots &{} u(1-n) \\ u(1)&{} u(2)&{} \cdots &{} u(2-n) \\ \vdots &{} \vdots &{} \cdots &{} \vdots \\ u(N-1)&{} u(N-2)&{} \cdots &{} u(N-n) \end{array} \right] ,\\&\theta _0=\left[ \begin{array}{c} g_1\\ g_2\\ \vdots \\ g_{n} \end{array} \right] , \ E=\left[ \begin{array}{c} e_1 \\ e_2 \\ \vdots \\ e_N \\ \end{array} \right] . \end{aligned} \end{aligned}$$

(3.45)

Then we have

$$\begin{aligned} \small \varPhi ^T\varPhi = \left[ \begin{array}{cccc} \sum _{t=0}^{N-1} u(t)^2&{} \sum _{t=0}^{N-1} u(t)u(t-1)&{} \ldots &{} \sum _{t=0}^{N-1} u(t)u(t-n+1) \\ \sum _{t=0}^{N-1} u(t)u(t-1)&{} \sum _{t=-1}^{N-2} u(t)^2&{} \ldots &{} \sum _{t=-1}^{N-2} u(t)u(t-n+2) \\ \vdots &{} \vdots &{} \ldots &{} \vdots \\ \sum _{t=0}^{N-1} u(t)u(t-n+1) &{} \sum _{t=-n+1}^{N-n} u(t) u(t+n-2)&{} \ldots &{} \sum _{t=-n+1}^{N-n} u(t)^2 \end{array} \right] . \end{aligned}$$

(3.46)

Since $\text {cond}(\varPhi ^T\varPhi )=(\text {cond}(\varPhi ))^2$, we study $\text {cond}(\varPhi ^T\varPhi )$ in what follows. In addition, while so far we have assumed deterministic regressors, now we work in a more structured probabilistic framework where the system input is a stochastic process. This implies that $\varPhi $ is a random matrix. In particular, u(t) is filtered white noise, with the filter assumed to be stable and given by

$$\begin{aligned} H(q)=\sum _{k=0}^\infty h(k)q^{-k}. \end{aligned}$$

(3.47a)

Hence,

$$\begin{aligned} u(t) = \sum _{k=0}^\infty h(k)v(t-k) = H(q)v(t), \end{aligned}$$

(3.47b)

where v(t) is zero-mean white noise of variance $\sigma ^2$ with bounded fourth moments. It comes that u(t) is a zero-mean stationary stochastic process with covariance function $k_u(t,s)={\mathscr {E}}[u(t)u(s)]=R_u(t-s)$ with $R_u(\tau )$ defined as follows:

$$\begin{aligned} {\mathscr {E}}[u(t)u(t-\tau )]&= \sum _{k=0}^\infty \sum _{l=0}^\infty h(k)h(l){\mathscr {E}}[v(t-k)v(t-\tau -l)]\\&=\sum _{k=0}^\infty h(k)h(k-\tau )\triangleq R_u(\tau ). \end{aligned}$$

From the ergodic theory, e.g., [25, Theorem 3.4], it also follows that

$$\begin{aligned} \frac{1}{N} \sum _{t=1}^N u(t)u(t-\tau ) \rightarrow R_u(\tau ), \quad N\rightarrow \infty ,\ \text { a.s.} \end{aligned}$$

(3.48)

From (3.46) and (3.48), one obtains the following almost sure convergence:

$$\begin{aligned}&\frac{1}{N}\varPhi ^T\varPhi \rightarrow \left[ \begin{array}{cccc} R_u(0)&{} R_u(1)&{} \cdots &{} R_u(n-1) \\ R_u(1)&{} R_u(0)&{} \cdots &{} R_u(n-2) \\ \vdots &{} \vdots &{} \cdots &{} \vdots \\ R_u(n-1) &{} R_u(n-2)&{} \cdots &{} R_u(0) \end{array} \right] , \quad N\rightarrow \infty ,\ \text { a.s.} \end{aligned}$$

(3.49)

So, $\lim _{N\rightarrow \infty }\frac{1}{N}\varPhi ^T\varPhi $ is the covariance matrix of $\left[ \begin{array}{ccc} u(1) &{} \ldots &{} u(n) \\ \end{array} \right] ^T $ whose condition number thus provides insights on the ill-conditioning affecting the system identification problem.

Since the covariance matrix is real and symmetric, its condition number is the ratio between the largest and the smallest of its eigenvalues. An important result of O. Toeplitz, e.g., [44], [20, Chap. 5], says that as $n\rightarrow \infty $, the eigenvalues of the covariance matrix of the infinite-dimensional vector $\left[ \begin{array}{ccc} u(1) &{} u(2) &{} \ldots \\ \end{array} \right] ^T$ coincide with the set of values assumed by the power spectrum of u(t), which is given by

$$\begin{aligned} \varPsi _u(\omega ) = \sum _{\tau =-\infty }^{+\infty } R_u(\tau )e^{-i\omega \tau }. \end{aligned}$$

(3.50)

Hence, considering also that $\varPsi _u(-\omega )=\varPsi _u(\omega )$, one has

$$\begin{aligned} \text {cond}\left( \lim _{n\rightarrow \infty }\lim _{N\rightarrow \infty }\frac{1}{N}\varPhi ^T\varPhi \right) = \frac{\max _{\omega \in [0, \pi ]} \varPsi _u(\omega )}{\min _{\omega \in [0, \pi ]} \varPsi _u(\omega )}. \end{aligned}$$

(3.51)

In addition, since u(t) is a filtered white noise (3.47) and H(q) is stable, one also has [see, e.g., [25, p. 37] for details]:

$$\begin{aligned} \varPsi _u(\omega ) = \sigma ^2 |H(e^{i\omega })|^2, \end{aligned}$$

(3.52)

where $H(e^{i\omega })$ is the frequency function of the filter H(q), i.e.,

$$\begin{aligned} H(e^{i\omega }) = \sum _{k=0}^\infty h(k)e^{-i\omega k}. \end{aligned}$$

(3.53)

Finally, combining the results (3.49)–(3.53) yields

$$\begin{aligned} \text {cond}\left( \lim _{n\rightarrow \infty }\lim _{N\rightarrow \infty }\frac{1}{N}\varPhi ^T\varPhi \right) = \frac{\max _{\omega \in [0, \pi ]} |H(e^{i\omega })|^2}{\min _{\omega \in [0, \pi ]} |H(e^{i\omega })|^2}. \end{aligned}$$

(3.54)

When the maximum of $|H(e^{i\omega })|$ is significantly larger than the minimum of $|H(e^{i\omega })|$, the matrix $\lim _{n\rightarrow \infty }\lim _{N\rightarrow \infty }\frac{1}{N}\varPhi ^T\varPhi $ could be very ill-conditioned. For instance, if we consider the stable filter

$$\begin{aligned} H(q) = \frac{1}{(1-aq^{-1})^2},\quad 0\le a< 1, \end{aligned}$$

(3.55)

then one has

$$\begin{aligned} \frac{\max _{\omega \in [0, \pi ]} |H(e^{i\omega })|^2}{\min _{\omega \in [0, \pi ]} |H(e^{i\omega })|^2}=\frac{(1+a)^4}{(1-a)^4}. \end{aligned}$$

(3.56)

As a varies from 0.01 to 0.99, input power is more concentrated at low frequencies and the ill-conditioning affecting the system identification problem augments. In fact, the above quantity increases from about 1 to $1.6\times 10^9$.

3.4 Regularized Least Squares with Quadratic Penalties

One way to handle ill-conditioning is to use regularized least squares (ReLS). Such method will play a special role in this book to control overfitting by encoding prior knowledge. First insights on these aspects are provided below.

ReLS adds a regularization term $J(\theta )$ into the LS criterion (3.14), yielding the following problem:

$$\begin{aligned} \hat{ \theta }^{\text {R}}&=\displaystyle \mathop {{\text {arg}}\,{\text {min}}}_\theta \Vert Y-\varPhi \theta \Vert _2^2 + \gamma J(\theta ), \end{aligned}$$

(3.57)

where $\gamma \ge 0$ is often called the regularization parameter. It has to balance the adherence to the data $\Vert Y-\varPhi \theta \Vert _2^2$ and the penalty $J(\theta )$. There are many choices for the regularization term which can be connected with the prior knowledge on the true model parameter $\theta _0$ that needs to be estimated.

In this section, we consider regularization terms $J(\theta )$ which are quadratic functions of $\theta $. The resulting estimator will be denoted by ReLS-Q in this chapter. In particular, we let $J(\theta )=\theta ^TP^{-1}\theta $ so that the ReLS criterion (3.57) becomes

$$\begin{aligned} \hat{ \theta }^{\text {R}}&=\displaystyle \mathop {{\text {arg}}\,{\text {min}}}_\theta \Vert Y-\varPhi \theta \Vert _2^2 + \gamma \theta ^TP^{-1}\theta \end{aligned}$$

(3.58a)

$$\begin{aligned}&=(\varPhi ^T\varPhi +\gamma P^{-1})^{-1}\varPhi ^TY \end{aligned}$$

(3.58b)

$$\begin{aligned}&=P\varPhi ^T(\varPhi P\varPhi ^T+\gamma I_{N})^{-1}Y \end{aligned}$$

(3.58c)

$$\begin{aligned}&=(P\varPhi ^T\varPhi +\gamma I_{n})^{-1}P\varPhi ^TY, \end{aligned}$$

(3.58d)

where $P\in {\mathbb R}^{n\times n}$ is a positive semidefinite matrix, here assumed invertible, often called the regularization matrix, and $I_n$ is the n-dimensional identity matrix.^{Footnote 2}

Remark 3.1

The regularization matrix P could be singular. In this case, (3.58a) is not well defined but, with a suitable arrangement, we can use the Moore–Penrose pseudoinverse $P^+$ instead of $P^{-1}$. In particular, let the SVD of P be

$$P= \begin{bmatrix}U_1&U_2\end{bmatrix}\begin{bmatrix}\varLambda _P &{} 0 \\ 0 &{} 0\end{bmatrix}\begin{bmatrix}U_1&U_2\end{bmatrix}^T, $$

where $\varLambda _P$ is a diagonal matrix with the positive singular values of P as diagonal elements and $U=\begin{bmatrix}U_1&U_2\end{bmatrix}$ is an orthogonal matrix with $U_1$ having the same number of columns as that of $\varLambda _P$. Recall also that $P^+= U_1\varLambda _P^{-1}U_1$. In order to find how (3.58a) should be modified for singular P, let us consider

$$ P_{\varepsilon }=\begin{bmatrix}U_1&U_2\end{bmatrix}\begin{bmatrix}\varLambda _P &{} 0 \\ 0 &{} \varepsilon I\end{bmatrix}\begin{bmatrix}U_1&U_2\end{bmatrix}^T, \quad \varepsilon > 0. $$

By replacing P with $P_{\varepsilon }$ in (3.58a), we obtain

$$\begin{aligned} \hat{ \theta }^{\text {R}}=\displaystyle \mathop {{\text {arg}}\,{\text {min}}}_\theta \Vert Y-\varPhi \theta \Vert _2^2 + \gamma \theta ^TU_1\varLambda _P^{-1}U_1^T\theta + \frac{\gamma }{\varepsilon } \theta ^TU_2 U_2^T\theta . \end{aligned}$$

(3.59)

If we let $\varepsilon \rightarrow 0$, it follows that the parameter vector must satisfy $U_2^T\theta =0$. Therefore, we may conveniently associate to a singular P the modified regularization problem

$$\begin{aligned} \hat{ \theta }^{\text {R}}&=\displaystyle \mathop {{\text {arg}}\,{\text {min}}}_\theta \Vert Y-\varPhi \theta \Vert _2^2 + \gamma \theta ^TP^+\theta \end{aligned}$$

(3.60a)

$$\begin{aligned}&\ {{\,\mathrm{subj. to\ }\,}}\quad U_2^T\theta =0. \end{aligned}$$

(3.60b)

If $P^{-1}$ is replaced by $P^+$, it is easy to verify that (3.58c) or (3.58d) is still the optimal solution of (3.60). Instead, this does not hold for (3.58b). For convenience, we will use (3.58a) in the sequel and refer to (3.60) for its rigorous meaning.

3.4.1 Making an Ill-Conditioned LS Problem Well Conditioned

The ReLS-Q can make the ill-conditioned LS problem well conditioned. Consider ridge regression which, as discussed in Sect. 1.2, corresponds to setting $P=I_n$, hence obtaining

$$\begin{aligned} \hat{ \theta }^{\text {R}}&=\displaystyle \mathop {{\text {arg}}\,{\text {min}}}_\theta \Vert Y-\varPhi \theta \Vert _2^2 + \gamma \Vert \theta \Vert _2^2 \end{aligned}$$

(3.61a)

$$\begin{aligned}&=(\varPhi ^T\varPhi +\gamma I_{n})^{-1}\varPhi ^TY. \end{aligned}$$

(3.61b)

The parameter $\gamma $ directly affects the condition number of $(\varPhi ^T\varPhi +\gamma I_{n})$ whose inverse defines the regularized estimate. In fact, the positive definite square matrix $(\varPhi ^T\varPhi +\gamma I_{n})$ has eigenvalues (coincident with its singular values) equal to $\sigma _i^2 + \gamma $. Therefore,

$$ \text {cond}(\varPhi ^T\varPhi +\gamma I_n)=\frac{\sigma _1^2+\gamma }{\sigma _n^2 + \gamma } $$

which can be adjusted by tuning the regularization parameter $\gamma $. This means that regularization can make the LS problem well conditioned even when $\varPhi $ is rank deficient: if the smallest singular value is null one has

$$ \text {cond}(\varPhi ^T\varPhi +\gamma I_n)=\frac{\sigma _1^2+\gamma }{\gamma }. $$

3.4.1.1 Mean Squared Error

Simple calculations of expectations with respect to the errors $e_i$, with the regressors $\phi _i$ assumed to be deterministic, lead to

$$\begin{aligned} {\mathscr {E}}(\hat{\theta }^{\text {R}})&= (\varPhi ^T\varPhi +\gamma P^{-1})^{-1}\varPhi ^T\varPhi \theta _0 \end{aligned}$$

(3.62a)

$$\begin{aligned} \hat{\theta }_{\text {bias}}^{\text {R}}&={\mathscr {E}}(\hat{\theta }^{\text {R}})-\theta _0= -(\varPhi ^T\varPhi +\gamma P^{-1})^{-1}\gamma P^{-1}\theta _0 \end{aligned}$$

(3.62b)

$$\begin{aligned} \text {Cov}(\hat{\theta }^{\text {R}},\hat{\theta }^{\text {R}})&={\mathscr {E}}[(\hat{\theta }^{\text {R}}-{\mathscr {E}}( \hat{\theta }^{\text {R}}))(\hat{\theta }^{\text {R}}-{\mathscr {E}}( \hat{\theta }^{\text {R}}))^T]\nonumber \\&=(\varPhi ^T\varPhi +\gamma P^{-1})^{-1}\sigma ^2\varPhi ^T\varPhi (\varPhi ^T\varPhi +\gamma P^{-1})^{-1} \end{aligned}$$

(3.62c)

$$\begin{aligned} \text {MSE}(\hat{\theta }^{\text {R}},\theta _0)&= {\mathscr {E}}(\hat{\theta }^{\text {R}}-\theta _0) (\hat{\theta }^{\text {R}}-\theta _0)^T\nonumber \\&=\text {Cov}(\hat{\theta }^{\text {R}},\hat{\theta }^{\text {R}})+\hat{\theta }_{\text {bias}}^{\text {R}}(\hat{\theta }_{\text {bias}}^{\text {R}})^T\nonumber \\&= (\varPhi ^T\varPhi +\gamma P^{-1})^{-1}(\sigma ^2\varPhi ^T\varPhi +\gamma ^2 P^{-1} \theta _0\theta _0^T P^{-1})(\varPhi ^T\varPhi +\gamma P^{-1})^{-1}, \end{aligned}$$

(3.62d)

where $\text {Cov}(\hat{\theta }^{\text {R}},\hat{\theta }^{\text {R}})$ is the covariance matrix of $\hat{\theta }^{\text {R}}$ and $\text {MSE}(\hat{\theta }^{\text {R}},\theta _0) $ is the MSE matrix of $\hat{\theta }^{\text {R}}$ function of the true model parameter $\theta _0$. Expression (3.62) shows clearly regularization’s influence on the statistical properties of $\hat{\theta }^{\text {R}}$:

when $\gamma =0$, i.e., there is no regularization, $\hat{\theta }^{\text {R}}$ reduces to $\hat{\theta }^{\text {LS}}$ and $\text {MSE}(\hat{\theta }^{\text {R}},\theta _0)$ reduces to $\sigma ^2(\varPhi ^T\varPhi )^{-1}$;
when $\gamma >0$, the regularized estimator $\hat{\theta }^{\text {R}}$ is biased and the MSE matrix of $\hat{\theta }^{\text {R}}$ is decomposed into two components: the bias $\hat{\theta }_{\text {bias}}^{\text {R}}(\hat{\theta }_{\text {bias}}^{\text {R}})^T$ and the variance $\text {Cov}(\hat{\theta }^{\text {R}},\hat{\theta }^{\text {R}})$. By a suitable choice of the regularization matrix P and the regularization parameter $\gamma $, the variance of $\hat{\theta }^{\text {R}}$ can be made “smaller” and, if the resulting increase in the bias is moderate, an MSE matrix “smaller” than that associated to LS can be obtained.

3.4.2 Equivalent Degrees of Freedom

For a given regularization matrix P, we have seen (also deriving the structure of the MSE) that the regularization parameter $\gamma $ controls the influence of the regularization: as $\gamma $ varies from 0 to $\infty $, the influence of the regularization $\theta ^TP^{-1}\theta $ becomes stronger. In particular, when $\gamma =0$ there is no regularization and $\hat{\theta }^{\text {R}}$ reduces to $\hat{\theta }^{\text {LS}}$. When $\gamma =\infty $ the regularization term $\gamma \theta ^TP^{-1}\theta $ overwhelms the data fit $\Vert Y-\varPhi \theta \Vert _2^2$ and one has $\hat{\theta }^{\text {R}}=0$.

Often, it is more convenient to exploit a normalized measure of the influence of the regularization instead of considering directly the value of $\gamma $. For this goal, we introduce the so-called influence or hat matrix:

$$\begin{aligned} H = \varPhi P \varPhi ^T (\varPhi P \varPhi ^T+\gamma I_{N})^{-1}. \end{aligned}$$

(3.63)

Such matrix is important since it connects the measured output Y with the predicted output $\hat{Y} = \varPhi \hat{\theta }^{\text {R}}$, i.e., one has

$$\begin{aligned} \hat{Y} = \varPhi \hat{\theta }^{\text {R}} = HY. \end{aligned}$$

(3.64)

It is also important since its trace is indeed a normalized measure of the influence of the regularization. To see this, let $A=\varPhi P \varPhi ^T$ and consider its SVD

$$ A=UDU^T, $$

where $UU^T=I$ and D is a diagonal matrix with nonnegative entries $d_i^2$. Then,

$$ H=UDU^T(UDU^T +\gamma I_{N}UU^T) ^{-1}= U D (D+\gamma I_{N}) ^{-1} U^T. $$

Since U is orthogonal, one has ${{\,\mathrm{trace}\,}}(UMU^T)={{\,\mathrm{trace}\,}}(M)$, so that

$$ {{\,\mathrm{trace}\,}}(H) = {{\,\mathrm{trace}\,}}(D (D+\gamma I_{N}) ^{-1}) = \sum _{i=1}^n \frac{d_i^2}{d_i^2+\gamma }. $$

The above equation implies that ${{\,\mathrm{trace}\,}}(H)$ is a monotonically decreasing function of $\gamma $. It attains its maximum at $\gamma =0$ and infimum as $\gamma \rightarrow \infty $. In particular, for $\gamma =0$ one has $\hat{\theta }^{\text {R}}=\hat{\theta }^{\text {LS}}$ and the hat matrix H becomes $H=\varPhi (\varPhi ^T\varPhi )^{-1}\varPhi ^T$, implying that ${{\,\mathrm{trace}\,}}(H)=n$ if $\varPhi $ is full rank. For $\gamma \rightarrow \infty $ one instead has ${{\,\mathrm{trace}\,}}(H)\rightarrow 0$. Therefore, it holds that $0<{{\,\mathrm{trace}\,}}(H)\le n$. Hence, since n is the dimension of $\theta $, i.e., the number of parameters in the linear regression model, ${{\,\mathrm{trace}\,}}(H)$ can be seen as the counterpart of the number of parameters to be estimated in the LS context. In other words, in the regularized framework ${{\,\mathrm{trace}\,}}(H)$ plays the role of the model order. It thus becomes natural to call it the equivalent degrees of freedom for the ReLS-Q estimate $\hat{\theta }^{\text {R}}$, e.g., [21, Sect. 7.6], [4, p. 559]:

$$\begin{aligned} {{\,\mathrm{dof}\,}}(\hat{\theta }^{\text {R}}) = {{\,\mathrm{trace}\,}}(H). \end{aligned}$$

(3.65)

The notation ${{\,\mathrm{dof}\,}}(\gamma )$ will be also used in the book in place of ${{\,\mathrm{dof}\,}}(\hat{\theta }^{\text {R}})$ to stress the dependence of the equivalent degrees of freedom on the regularization parameter.

Example 3.6

(Polynomial regression: ridge regression) As shown in Fig. 3.6, the regression matrix $\varPhi $ built in the polynomial regression Example (3.26) and (3.27) is ill-conditioned for large n. Here, we consider the case $n=16$ (corresponding to a polynomial order 15) which leads to ${{\,\mathrm{cond}\,}}(\varPhi )=1.49\times 10^{11}$. To illustrate how ridge regression (3.61) can face the ill-conditioning, let $\gamma =\gamma _i$, $i=1,\ldots ,16$, with $\gamma _1=0.01$ and $\gamma _{16}=0.31$ and $\gamma _2,\ldots ,\gamma _{15}$ evenly spaced between $\gamma _1$ and $\gamma _{16}$. For each $\gamma _i$, we then compute the corresponding ridge regression estimate (3.61) and plot the 16 estimates $\hat{g}(x) = \phi (x)^T\hat{\theta }^{\text {R}}$ in Fig. 3.7. The fits (3.28) are shown in Fig. 3.8 as a function of $\gamma $. One can see that $\gamma =0.11$ gives the best performance obtaining a fit around $89\%$. Interestingly, such fit is larger than the best result obtained by LS through optimal tuning of the discrete model order, see Fig. 3.4. The base 10 logarithm of the condition number of $\varPhi ^T\varPhi +\gamma I_n$, as a function of $\gamma $, is displayed in Fig. 3.9. One can see that the matrix is much better conditioned now. Figure 3.10 plots the equivalent degrees of freedom of $\hat{\theta }^{\text {R}}$. Even if $n=16$, the actual model complexity in terms of equivalent degrees of freedom is much smaller, around 4 for the tested values of $\gamma $. Finally, the estimates of any component of $\theta $ obtained using the different values of $\gamma $ are shown in Fig. 3.11.

$\square $

3.4.2.1 Regularization Design: The Optimal Regularizer

A natural question is how to design a regularization matrix P and select $\gamma $ to obtain a “good” model estimate. From a “classic” or “frequentist” point of view, rational choices are those that make the MSE matrix (3.62d) small in some sense, as discussed below. For our purposes, it is useful to rewrite the MSE matrix (3.62d) as follows:

$$\begin{aligned}&\text {MSE}(\hat{\theta }^{\text {R}},\theta _0) = \sigma ^{2}\left( \frac{ P\varPhi ^T\varPhi }{\gamma }+I_n\right) ^{-1}\left( \frac{P\varPhi ^T\varPhi P}{\gamma ^{2} }+\frac{\theta _0\theta _0^T}{\sigma ^2}\right) \left( \frac{\varPhi ^T\varPhi P}{\gamma } +I_n \right) ^{-1}. \end{aligned}$$

(3.66)

Then, it is useful to first introduce the following lemma.

Lemma 3.1

(based on [9]) Consider the matrix

$$\begin{aligned} M(Q) =&(QR+I)^{-1}(QRQ+Z)(RQ+I)^{-1}, \end{aligned}$$

where Q, R and Z are positive semidefinite matrices. Then for all Q

$$\begin{aligned} M(Z) \preceq M(Q), \end{aligned}$$

(3.67)

which means that $M(Q) - M(Z)$ is positive semidefinite.

The proof consists of straightforward calculations and can be found in Sect. 3.8.2.

Using (3.66) and Lemma 3.1, the question which P and $\gamma $ give the best MSE of $\hat{\theta }^{\text {R}}$ has a clear answer: the equation $\sigma ^2 P=\gamma \theta _0\theta _0^T$ needs to be satisfied. Thus, the following result holds.

Proposition 3.1

(Optimal regularization for a given $\theta _0$, based on [9]) Letting $\gamma =\sigma ^2$, the regularization matrix

$$\begin{aligned} P=\theta _0\theta _0^T \end{aligned}$$

(3.68)

minimizes the MSE matrix (3.66) in the sense of (3.67).

Note that the MSE matrix (3.66) is linear in $\theta _0\theta _0^T$. This means that if we compute $\hat{\theta }^{\text {R}}$ with the same P for a collection of true systems $\theta _0$, the average MSE over that collection will be given by (3.66) with $\theta _0\theta _0^T$ replaced by its average over the collection. In particular, if $\theta _0$ is a random vector with ${\mathscr {E}}(\theta _0\theta _0^T)=\varPi $, we obtain the following result.

Proposition 3.2

(Optimal regularization for a random system $\theta _0$, based on [9]) Consider (3.62d) with $\gamma =\sigma ^2$. Then, the best average (expected) MSE for a random true system $\theta _0$ with ${\mathscr {E}}(\theta _0\theta _0^T)=\varPi $ is obtained by the regularization matrix $P=\varPi $.

Propositions 3.1 and 3.2 thus give a somewhat preliminary answer to our design problem. Since the best regularization matrix $P=\theta _0\theta _0^T$ depends on the true system $\theta _0$, such formula cannot be used in practice. Nevertheless, it suggests to choose a regularization matrix which mimics the behaviour of $\theta _0\theta _0^T$. Using prior knowledge on the true system $\theta _0$, this can be done by postulating a parametrized family of matrices $P(\eta )$ with $\eta \in \varGamma \subset {\mathbb R}^m$, where $\eta $ is the so-called hyperparameter vector, $\varGamma $ is the set where $\eta $ can vary and m is the dimension of $\eta $. Thus, the choice of a parametrized regularization matrix is similar to model structure selection in system identification. The nature of the optimal regularizer suggests also to set

$$\begin{aligned} \gamma = \sigma ^2. \end{aligned}$$

(3.69)

However, the noise variance $\sigma ^2$ is in general unknown and needs to be estimated from the data. One can adopt equations (3.22) or (3.23). Another option is to include $\sigma ^2$ in $\eta $ and then estimate it together with the other hyperparameters.

3.5 Regularization Tuning for Quadratic Penalties

3.5.1 Mean Squared Error and Expected Validation Error

Now, assume that a parametrized family of regularization matrices $P(\eta )$ has been defined. The vector $\eta $ is in general unknown and has to be tuned by using the available measurements. The ReLS-Q estimate $\hat{\theta }^{\text {R}}(\eta )$ in (3.58) depends on $\eta $ and the estimation strategy depends on the measure used to quantify its quality. We will consider the following two criteria:

minimizing the MSE;
minimizing the expected validation error (EVE).

3.5.1.1 Minimizing the MSE

Still adopting a “classic” or “frequentist” point of view, a rational choice of $\eta $ is one that makes the MSE matrix (3.62d) small in some sense. For ease of estimation, a scalar measure is often exploited. In [25, Chap. 12], it is suggested to use a weighting matrix Q and ${{\,\mathrm{trace}\,}}(\text {MSE}({\hat{\theta }^{\text {R}}(\eta )},\theta _0)Q)$ as a quality measure of $\hat{\theta }^{\text {R}}(\eta )$, where Q reflects the intended use of the model $\hat{\theta }^{\text {R}}(\eta )$. Then an estimate of $\eta $, say $\hat{\eta }$, is obtained as follows:

$$\begin{aligned} \hat{\eta } = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{\eta \in \varGamma } {{\,\mathrm{trace}\,}}(\text {MSE}({\hat{\theta }^{\text {R}}(\eta )},\theta _0)Q). \end{aligned}$$

(3.70)

Note that (3.70) depends on the true system $\theta _0$ that is unknown and thus cannot be used. In practice, we need to first find a “good” estimate, say $\hat{\theta }$, of the true system $\theta _0$ and then to replace $\theta _0$ in (3.70) with $\hat{\theta }$. Then, hopefully, a “good” estimate is given by

$$\begin{aligned} \hat{\eta }&= \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{\eta \in \varGamma } {{\,\mathrm{trace}\,}}(\text {MSE}({\hat{\theta }^{\text {R}}(\eta )},\hat{\theta })Q). \end{aligned}$$

(3.71)

Different choices of $\hat{\theta }$ and Q lead to different estimators (3.71). Examples are obtained setting $\hat{\theta }$ to the LS estimate or to the ridge regression estimate of $\theta _0$, while the choice $Q=I_n$ is often used. In any case, the major difficulty underlying the idea of “minimizing the MSE” for hyperparameters tuning lies in whether or not $\hat{\theta }$ is a “good” estimate of $\theta _0$, which is actually our fundamental problem.

3.5.1.2 Minimizing the EVE

An alternative quality measure of $\hat{\theta }^{\text {R}}(\eta )$ is related to model prediction capability on independent validation data and is characterized by the expected validation error (EVE).

To define it, we need to introduce the training/estimation data and the validation data. The training data is used for estimating the model and is contained in the set $\mathscr {D}_\text {T}$. The validation data are used to assess model prediction capability and are in the set $\mathscr {D}_\text {V}$.

Now, let $\hat{\theta }^{\text {R}}(\eta )$ denote a general ReLS-Q estimate parametrized by the vector $\eta $ and obtained using only the training data $\mathscr {D}_\text {T}$. Let $y_\text {v}\in {\mathbb R}$, $\phi _{\text {v}}\in {\mathbb R}^n$ be a validation sample pair. These objects could both be random, e.g., $y_\text {v}$ can be affected by noise and the regressor could be defined by a stochastic system input. The validation error $\text {EVE}_{\mathscr {D}_\text {T}}(\eta )$ is then given by

$$\begin{aligned} \text {EVE}_{\mathscr {D}_\text {T}}(\eta )={\mathscr {E}}[(y_{\text {v}}-\phi _{\text {v}}^T\hat{\theta }^{\text {R}}(\eta ))^2|\mathscr {D}_\text {T}]. \end{aligned}$$

(3.72)

In the above equation, the expectation ${\mathscr {E}}$ is computed w.r.t. the joint distribution of $y_\text {v}$ and $\phi _{\text {v}}$ conditioned on the training data $\mathscr {D}_\text {T}$. If $\phi _{\text {v}}\in {\mathbb R}^n$ is deterministic and, as usual, $y_\text {v}$ is affected by a noise independent by those entering the training set, the mean is taken just w.r.t. such noise, with $\mathscr {D}_\text {T}$ which influences only $\hat{\theta }^{\text {R}}$. In any case, the result is a function of the training set. Now, we can see $\mathscr {D}_\text {T}$ as random and then the EVE is

$$\begin{aligned} \text {EVE}(\eta )\triangleq {\mathscr {E}}[\text {EVE}_{\mathscr {D}_\text {T}}(\eta )], \end{aligned}$$

(3.73)

where the expectation ${\mathscr {E}}$ is over the training set. Note that the final result is function of the true $\theta _0$ which determines the probability distributions of the training and validation data.

The $\text {EVE}(\eta )$ measures the prediction capability of the model $\hat{\theta }^{\text {R}}(\eta )$ before seeing any training or validation data: the smaller the $\text {EVE}(\eta )$, the better the expected model prediction capability. Therefore, it is natural to estimate $\eta $ as follows:

$$\begin{aligned} \hat{\eta }=\displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{\eta \in \varGamma }\ \text {EVE}(\eta ). \end{aligned}$$

(3.74)

However, as said, the above objective depends on the unknown vector $\theta _0$ so that estimation of $\theta $ is not possible in practice. The problem is analogous to that encountered when trying to tune $\eta $ by minimizing the MSE

Remark 3.2

Interestingly, the idea of “minimizing the MSE” and the idea of “minimizing the EVE” are connected. To see this, we assume for simplicity that the regressors $\phi _i$, $i=1,\ldots ,N$ in the training data and $\phi _{\text {v}}$ in the validation data are deterministic. Then it can be shown that

$$\begin{aligned} EVE (\eta )&={\mathscr {E}}[(y_{v }-\phi _{v }^T\hat{\theta }^{R }(\eta ))^2] =\sigma ^2+\phi _{v }^T MSE (\hat{\theta }^{R }(\eta ),\theta _0)\phi _{v }, \end{aligned}$$

(3.75)

where the expectation ${\mathscr {E}}$ is over everything that is random, and $MSE (\hat{\theta }^{R }(\eta ),\theta _0)$ is the MSE matrix of $\hat{\theta }^{R }(\eta )$ defined in (3.62d). Clearly, (3.75) shows that minimizing $EVE (\eta )$ with respect to $\eta $ is equivalent to minimizing ${{\,\mathrm{trace}\,}}( MSE (\hat{\theta }^{R }(\eta ),\theta _0)Q)$ with respect to $\eta $ when $Q=\phi _{v }\phi _{v }^T$.

To overcome the fact that the EVE depends on the unknown $\theta _0$, we could first find a “good” estimate of $\text {EVE}(\eta )$ using the available data and then determine the hyperparameter vector by minimizing it. There are two ways to achieve this goal: by efficient sample reuse of the data and by considering the in-sample EVE instead. More details will be provided in the next two subsections.

3.5.2 Efficient Sample Reuse

One way to estimate $\text {EVE}(\eta )$ by exploiting efficient sample reuse includes cross-validation (CV) [41] and its variants already mentioned in Sects. 2.6.3 and 3.2.2 when discussing model order selection.

3.5.2.1 Hold Out Cross-Validation

The simplest CV is the so-called hold out CV (HOCV), which is widely used to select the model order for the classical PEM/ML. The HOCV can also be used to estimate the hyperparameter $\eta \in \varGamma $ for the ReLS-Q method.

The idea of hold out CV is to first split the given data into two parts: the training data $\mathscr {D}_\text {T}$ and the validation data $\mathscr {D}_\text {V}$. The prediction capability is measured in terms of the validation error. The model that gives the smallest validation error will be selected. More specifically, the HOCV takes the following three steps:

(1)
Split the given data into two parts: $\mathscr {D}_\text {T}$ and $\mathscr {D}_\text {V}$.
(2)
Estimate the model $\hat{\theta }^{\text {R}}(\eta )$ based on $\mathscr {D}_\text {T}$ for different values of $\eta \in \varGamma $.
(3)
Calculate the validation error for $\hat{\theta }^{\text {R}}(\eta )$ over the validation data $\mathscr {D}_\text {v}$:
$$\begin{aligned} \text {CV}(\eta )&= \sum _{(y_\text {v},\phi _\text {v})\in \mathscr {D}_\text {v} } (y_\text {v}-\phi _\text {v}^T\hat{\theta }^{\text {R}}(\eta ))^2, \end{aligned}$$
where the summation is over all pairs of $(y_\text {v},\phi _\text {v})$ in the validation data $\mathscr {D}_\text {v}$. Then, select the value of $\eta $ that minimizes $\text {CV}(\eta )$:
$$\begin{aligned} \hat{\eta } = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{\eta \in \varGamma }\ \text {CV}(\eta ). \end{aligned}$$
(3.76)

It is also possible to change the role of the training and validation sets in order to perform a second validation step: the model is estimated on the previous validation set and the validation error is computed on the previous training set. Finally, the final validation error is obtained by averaging the two validation errors.

3.5.2.2 k-Fold Cross-Validation

The HOCV with swapped sets is a special case of the more general k-fold CV with $k=2$, e.g., [24]. If the data set size is small, the HOCV may perform poorly. In fact, the training data may not be sufficiently rich to build good models and a validation set of small size may give a too uncertain validation error. In this case, the k-fold CV with $k>2$ could be used.

The idea of k-fold CV is to first split the data into k parts of equal size. For every $\eta \in \varGamma $, the following procedure is repeated k times. At the ith run with $i=1,2,\ldots ,k$:

(1)
Retain the ith part as the validation data $\mathscr {D}_{\text {V},i}$, and use the remaining $k-1$ parts as the training data $\mathscr {D}_{\text {T},-i}$.
(2)
Estimate $\hat{\theta }^{\text {R}}(\eta )$ based on the training data $\mathscr {D}_{\text {T},-i}$ and then calculate the validation error over the validation data $\mathscr {D}_{\text {V},i}$
$$\begin{aligned} \text {CV}_{-i}(\eta )&= \sum _{(y_\text {v},\phi _\text {v})\in \mathscr {D}_{\text {V},i} } (y_v-\phi _v^T\hat{\theta }^{\text {R}}(\eta ))^2, \end{aligned}$$
where the summation is over all pairs of $(y_\text {v},\phi _\text {v})$ in the validation data $\mathscr {D}_{\text {V},i}$.

Finally, the k validation errors $\text {CV}_{-i}(\eta )$ so obtained are summed to obtain the following total validation error for $\eta $:

$$\begin{aligned} \text {CV}(\eta )=\sum _{i=1}^k\text {CV}_{-i}(\eta ), \end{aligned}$$

and the estimate of $\eta $ is finally given by

$$\begin{aligned} \hat{\eta } = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{\eta \in \varGamma }\ \text {CV}(\eta ). \end{aligned}$$

(3.77)

3.5.2.3 Predicted Residual Error Sum of Squares and Variants

The computation of the k-fold CV is often expensive and an exception is the leave-one-out CV (LOOCV) where the validation set includes only one validation pair. When the square loss function is used, the total validation error admits a closed-form expression and the LOOCV is also known as the predicted residual error sum of squares (PRESS), e.g., [2].

First, recall the linear regression model (3.13) and the corresponding data $y_i\in {\mathbb R}$ and $\phi _i\in {\mathbb R}^n$ for $i=1,\ldots ,N$. Then the ReLS-Q estimate is

$$\begin{aligned} \hat{\theta }^{\text {R}}&= \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_\theta ||Y-\varPhi \theta ||^2 + \sigma ^2\theta ^TP^{-1}(\eta )\theta \nonumber \\&= \left( \varPhi ^T\varPhi + \sigma ^2 P^{-1}(\eta )\right) ^{-1}\varPhi ^T Y\\ \nonumber&= \left( \sum _{i=1}^N \phi _i\phi _i^T + \sigma ^2 P^{-1}(\eta )\right) ^{-1}\sum _{i=1}^N, \phi _iy_i, \end{aligned}$$

(3.78)

where we have set $\gamma =\sigma ^2$ following (3.69). For the kth measured output $y_k$, the corresponding predicted output $\hat{y}_k$ and residual $r_k$ are, respectively,

$$\begin{aligned} \hat{y}_k&=\phi _k^T \left( \sum _{i=1}^N \phi _i\phi _i^T + \sigma ^2 P^{-1}(\eta )\right) ^{-1}\sum _{i=1}^N \phi _iy_i, \end{aligned}$$

(3.79a)

$$\begin{aligned} r_k&= y_k - \hat{y}_k. \end{aligned}$$

(3.79b)

Then, PRESS selects the value of $\eta \in \varGamma $ that minimizes the sum of squares of the validation errors. One can prove that this corresponds to the following problem:

$$\begin{aligned} \text {PRESS:}\quad \hat{\eta }=\displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{\eta \in \varGamma }\ \sum _{k=1}^N \frac{r_k^2}{(1-\phi _k^TM^{-1}\phi _k)^2}, \end{aligned}$$

(3.80)

where $r_k$ are defined by (3.79) while

$$\begin{aligned} M= \sum _{i=1}^N \phi _i\phi _i^T + \sigma ^2 P^{-1}(\eta ). \end{aligned}$$

(3.81)

The derivation of (3.80) can be found in Sect. 3.8.3. It is worth noting that the denominator in (3.80) is strictly related to the diagonal entries of the hat matrix H defined in (3.63). In fact,

$$ \phi _k^TM^{-1}\phi _k = H_{kk} $$

so that

$$ \text {PRESS:}\quad \hat{\eta } =\displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{\eta \in \varGamma }\ \sum _{k=1}^N \frac{r_k^2}{(1-H_{kk})^2}. $$

Hence, interestingly, one can conclude that PRESS evaluation requires to compute just the ReLS-Q estimate exploiting the full data set (instead of solving N problems, one for each missing measurement in the training set).

One method that is closely related with PRESS is the so-called generalized cross-validation (GCV), e.g., [18]. GCV is obtained by replacing in (3.80) the factors $H_{kk}$ by their average, i.e., ${{\,\mathrm{trace}\,}}(H)/N$:

$$\begin{aligned} \text {GCV:}\quad \hat{\eta } = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{\eta \in \varGamma } \frac{1}{(1-{{\,\mathrm{trace}\,}}(H)/N)^2}\sum _{k=1}^Nr_k^2. \end{aligned}$$

(3.82)

Recalling (3.65), the term ${{\,\mathrm{trace}\,}}(H)$ defines the degrees of freedom of $\hat{\theta }^{\text {R}}$. Hence, the GCV criterion can be rewritten as follows:

$$\begin{aligned}\text {GCV:}\quad \hat{\eta } = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{\eta \in \varGamma } \frac{1}{(1-\text {dof}(\theta ^{\text {R}})/N)^2}\sum _{k=1}^Nr_k^2. \end{aligned}$$

3.5.3 Expected In-Sample Validation Error

In the definition of the validation error $\text {EVE}_{\mathscr {D}_\text {T}}$ (3.72), reported for convenience also below

$$\begin{aligned} \text {EVE}_{\mathscr {D}_\text {T}}(\eta )={\mathscr {E}}[(y_{\text {v}}-\phi _{\text {v}}^T\hat{\theta }^{\text {R}}(\eta ))^2|\mathscr {D}_\text {T}], \end{aligned}$$

we assumed that the conditional expectation ${\mathscr {E}}$ is over the independent validation sample pair $y_\text {v}\in {\mathbb R}$, $\phi _{\text {v}}\in {\mathbb R}^n$, which are drawn randomly from their joint distribution. The computation of the validation error (3.72) could become easier if independent validation sample pairs $y_\text {v}\in {\mathbb R}$, $\phi _{\text {v}}\in {\mathbb R}^n$ are generated in a particular way.

For linear regression problems, it is convenient to assume that the same deterministic regressors $\phi _i$, $i=1,2,\ldots ,N$, are used for generating both the training data and the validation data. To be specific, still using $\theta _0$ to denote the true parameter vector, we recall from (3.6), that the training output samples are

$$\begin{aligned} y_{i} = \phi _i^T\theta _0+ e_{i},\quad i=1,\ldots ,N. \end{aligned}$$

(3.83)

In this case, the training set is

$$\begin{aligned} \mathscr {D}_\text {T}=\{(y_i,\phi _i) \ | \ y_i\in {\mathbb R},\phi _i\in {\mathbb R}^n \text { satisfying (3.83)},\ i=1,\ldots ,N\}. \end{aligned}$$

(3.84)

Using the same regressors $\phi _i$, consider a set of validation output samples $y_{\text {v},i}$ as follows:

$$\begin{aligned} y_{\text {v},i} = \phi _i^T\theta _0+ e_{\text {v},i},\quad i=1,\ldots ,N, \end{aligned}$$

(3.85)

where $\theta _0$ is the true parameter vector, with the noises $e_{i}$ and $e_{\text {v},i}$ assumed identically and independently distributed. The validation error is now denoted by ${\text {EVE}_{\text {in}}}_{\mathscr {D}_\text {T}}(\eta )$, computed as follows:

$$\begin{aligned}&{\text {EVE}_{\text {in}}}_{\mathscr {D}_\text {T}}(\eta )=\frac{1}{N}\sum _{i=1}^N{\mathscr {E}}[(y_{\text {v},i}-\phi ^T_i\hat{\theta }^{\text {R}}(\eta ))^2|\mathscr {D}_\text {T}], \end{aligned}$$

(3.86)

and called in-sample validation error [21, p. 228]. Note that, similarly to what discussed after (3.72), the expectation ${\mathscr {E}}$ in (3.86) is computed w.r.t. the joint distribution of the couples $y_{\text {v},i},\phi _i$ conditioned on the training data $\mathscr {D}_\text {T}$. Thus, the result is function of the training set. As done in (3.73), we can remove such dependence by computing the expected in-sample validation error as

$$\begin{aligned}&\text {EVE}_{\text {in}}(\eta )={\mathscr {E}}[{\text {EVE}_{\text {in}}}_{\mathscr {D}_\text {T}}(\eta )], \end{aligned}$$

(3.87)

with expectation taken over the joint distribution of the training data. In what follows, we will see how to build an unbiased estimator of $\text {EVE}_{\text {in}}(\eta )$ using the training data (3.84), and how to exploit it for hyperparameters tuning.

3.5.3.1 Expectation of the Sum of Squared Residuals, Optimism and Degrees of Freedom

To estimate $\text {EVE}_{\text {in}}(\eta )$, consider the sum of squared residuals

$$\begin{aligned} \overline{\text {err}}(\eta )_{\mathscr {D}_\text {T}}= \frac{1}{N}\sum _{i=1}^N (y_i-\phi ^T_i\hat{\theta }^{\text {R}}(\eta ))^2, \end{aligned}$$

(3.88)

which is function only of the training set. Its expectation w.r.t. the training data (3.84) is

$$\begin{aligned} \overline{\text {err}}(\eta ) = {\mathscr {E}}\left( \frac{1}{N}\sum _{i=1}^N (y_i-\phi ^T_i\hat{\theta }^{\text {R}}(\eta ))^2\right) . \end{aligned}$$

(3.89)

One expects $\text {EVE}_{\text {in}}(\eta )$ to be not smaller than $\overline{\text {err}}(\eta ) $ because this latter quantity exploits the same data to fit the model and to assess the error. This intuition is indeed true as shown in the following theorem whose proof is in Sect. 3.8.4.

Theorem 3.7

Consider the linear regression model (3.13) with the training data (3.84), the validation data (3.85) and the ReLS-Q estimate (3.58). Then it holds that

$$\begin{aligned} \overline{\text {err}}(\eta ) \le EVE _{in }(\eta ).\end{aligned}$$

(3.90)

Theorem 3.7 shows that the expectation of the sum of squares of the residuals is an overly optimistic estimator of the expected in-sample validation error $\text {EVE}_{\text {in}}(\eta )$. The difference between $\text {EVE}_{\text {in}}(\eta )$ and $\overline{\text {err}}(\eta )$ is called the optimism in statistics. In particular, one has, see, e.g., [21, p. 229]:

$$\begin{aligned} \text {EVE}_{\text {in}}(\eta )= \overline{\text {err}}(\eta ) + \text {optimism}(\eta ), \end{aligned}$$

(3.91)

where rewriting (3.83) as

$$\begin{aligned} Y=\varPhi \theta _0+E, \end{aligned}$$

(3.92)

and defining the output prediction as

$$ \hat{Y}(\eta ) = \varPhi \hat{\theta }^{\text {R}}(\eta ), $$

it holds that

$$\begin{aligned} \text {optimism}(\eta )&=2\frac{1}{N}{{\,\mathrm{trace}\,}}(\text {Cov}(Y,\hat{Y}(\eta )))\ge 0. \end{aligned}$$

(3.93)

Combining arguments contained in the proof of Theorem 3.7 reported in the appendix to this chapter, see, in particular, (3.164), with the definition of equivalent degrees of freedom in (3.65), one obtains that

$$\begin{aligned} {{\,\mathrm{trace}\,}}(\text {Cov}(Y,\hat{Y}(\eta )))=\sigma ^2\text {dof}(\hat{\theta }^{\text{ R }}(\eta )). \end{aligned}$$

(3.94)

This thus reveals the deep connection between the optimism and the equivalent degrees of freedom.

3.5.3.2 An Unbiased Estimator of the Expected In-Sample Validation Error

Exploiting (3.94), we can now rewrite (3.91) as

$$\begin{aligned} \text {EVE}_{\text {in}}(\eta )= \overline{\text {err}}(\eta ) + 2\sigma ^2\frac{\text {dof}(\hat{\theta }^\text {R}(\eta ))}{N}. \end{aligned}$$

(3.95)

Interestingly, on the left-hand side of (3.95), $\text {EVE}_{\text {in}}(\eta )$, by definition (3.87), is the mean of a random variable which depends on both the training data (3.84) and the validation data (3.85). Instead, on the right-hand side of (3.95), $\overline{\text {err}}(\eta )$ is the expectation of a random variable which depends only on the training data. Hence, an unbiased estimator $\widehat{\text {EVE}_{\text {in}}}(\eta )$ of $\text {EVE}_{\text {in}}(\eta )$ is obtained just replacing $\overline{\text {err}}(\eta )$ with $\overline{\text {err}}(\eta )_{\mathscr {D}_\text {T}}$ reported in (3.88). One thus obtains

$$\begin{aligned} \nonumber \widehat{\text {EVE}_{\text {in}}}(\eta )&= \overline{\text {err}}(\eta )_{\mathscr {D}_\text {T}} + 2\sigma ^2\frac{\text {dof}(\hat{\theta }^\text {R}(\eta ))}{N}\\ {}&=\frac{1}{N}\Vert Y-\varPhi \hat{\theta }^{\text {R}}(\eta )\Vert _2^2+2\sigma ^2\frac{\text {dof}(\hat{\theta }^\text {R}(\eta ))}{N}. \end{aligned}$$

(3.96)

So, after observing the training data (3.84), the hyperparameter $\eta $ can be estimated as follows:

$$\begin{aligned} \hat{\eta } = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{\eta \in \varGamma } \frac{1}{N} \Vert Y-\varPhi \hat{\theta }^{\text {R}}(\eta )\Vert _2^2+2\sigma ^2\frac{\text {dof}(\hat{\theta }^\text {R}(\eta ))}{N}. \end{aligned}$$

(3.97)

The hyperparameter estimation criterion (3.97) has different names in statistics: it is known as the CP statistics, e.g., [27] and Stein’s unbiased risk estimator (SURE), e.g., [40].

Interestingly, as it will be clear from the proof of Theorem 3.7, the above formula (3.97) still provides an unbiased prediction risk estimator also if we replace $\varPhi \theta _0$ in (3.92) with a generic vector $\mu $ s.t. $Y=\mu +E$. Hence, one does not need to assume the existence of the true $\theta _0$ and of a regression matrix which describes the linear input–output relation. A variant of the expected in-sample validation error is also discussed in Sect. 3.8.5.

3.5.3.3 Excess Degrees of Freedom*

In the previous subsection, we have discussed how to construct an unbiased estimator of the expected in-sample validation error, see (3.96), and how to use it for hyperparameters tuning, see (3.97). Irrespective of the particular method adopted for hyperparameter estimation, the estimate $\hat{\eta }$ of $\eta $ depends on the data Y, with the regression matrix $\varPhi $ here assumed deterministic and known. We stress this by writing

$$\hat{\eta } = \hat{\eta }(Y).$$

Accordingly, the ReLS-Q estimate (3.58) with $\eta $ replaced by $ \hat{\eta }(Y)$ becomes

$$\begin{aligned} \hat{\theta }^{\text {R}}(\hat{\eta }(Y))=(\varPhi ^T\varPhi +\sigma ^2 P^{-1}(\hat{\eta }(Y)))^{-1}\varPhi ^TY. \end{aligned}$$

(3.98)

Since $\hat{\eta }$ is a random vector, to design a true unbiased estimator of the expected in-sample validation error of $\hat{\theta }^{\text {R}}(\hat{\eta }(Y))$ one should not use (3.96) since it assumes the hyperparameter $\eta $ constant.

In what follows, we will derive an unbiased estimator of the expected in-sample validation error of $\hat{\theta }^{\text {R}}(\hat{\eta }(Y))$. Such an estimator will thus be able to account also for the price of estimating model complexity (the degrees of freedom) from data. To this goal, we need the following version of Stein’s Lemma [40], a simplified version of which was already introduced in Chap. 1.

Lemma 3.2

(Stein’s Lemma, adapted from [40]) Consider the following additive measurement model:

$$\begin{aligned} x=\mu +\varepsilon ,\qquad x,\mu ,\varepsilon \in {\mathbb R}^p, \end{aligned}$$

where $\mu $ is an unknown constant vector and $\varepsilon \sim N(0,\varSigma )$. Let $\hat{\mu }(x)$ be an estimator of $\mu $ based on the data x such that $\text {Cov}(\hat{\mu }(x),x)$ and ${\mathscr {E}}(\frac{\partial \hat{\mu }(x)}{\partial x})$ exist. Then

$$\text {Cov}(\hat{\mu }(x),x)={\mathscr {E}}\left( \frac{\partial \hat{\mu }(x)}{\partial x}\right) \varSigma .$$

Let

$$\begin{aligned} Y_{\text {v}}=\left[ \begin{array}{c} y_{\text {v},1} \\ y_{\text {v},2} \\ \vdots \\ y_{\text {v},N} \\ \end{array} \right] , \ E_{\text {v}}=\left[ \begin{array}{c} e_{\text {v},1} \\ e_{\text {v},2} \\ \vdots \\ e_{\text {v},N} \\ \end{array} \right] , \end{aligned}$$

(3.99)

so that (3.85) can be rewritten as

$$\begin{aligned} Y_{\text {v}}=\varPhi \theta _0+E_{\text {v}}. \end{aligned}$$

(3.100)

Now, let us consider the measurements model (3.92) and the validation data (3.100), assuming also that

$$\begin{aligned} E\sim N(0,\sigma ^2 I_N), \ E_{\text {v}}\sim N(0,\sigma ^2 I_N). \end{aligned}$$

(3.101)

Then, using the correspondences

$$\begin{aligned}&x=Y,\mu =\varPhi \theta _0,\hat{\mu }(x) = \varPhi \hat{\theta }^{\text {R}}(\hat{\eta }(Y)),\tilde{x} = Y_{\text {v}}, \varepsilon =V, \tilde{\varepsilon }=E_{\text {v}}, \varSigma =\sigma ^2 I_N \\&f(Y,\hat{\eta })=\varPhi \hat{\theta }^{\text {R}}(\hat{\eta }(Y))=\varPhi (\varPhi ^T\varPhi +\sigma ^2 P^{-1}(\hat{\eta }(Y)))^{-1}\varPhi ^TY, \end{aligned}$$

together with (3.161) in the appendix to this chapter, one can prove that

$$\begin{aligned}&\underbrace{{\mathscr {E}}\left[ \frac{1}{N}{\mathscr {E}}[\Vert Y_{\text {v}}-\varPhi \hat{\theta }^{\text {R}}(\hat{\eta }(Y))\Vert _2^2|\mathscr {D}_\text {T}]\right] }_{\text {EVE}_{\text {in}}(\eta )}-\underbrace{{\mathscr {E}}\left[ \frac{1}{N} \Vert Y-\varPhi \hat{\theta }^{\text {R}}(\hat{\eta }(Y))\Vert _2^2\right] }_{\overline{\text {err}}(\eta )}\\&\qquad \qquad =2\frac{1}{N}{{\,\mathrm{trace}\,}}(\text {Cov}(Y,\varPhi \hat{\theta }^{\text {R}}(\hat{\eta }(Y)))).\end{aligned}$$

Using Stein’s Lemma, one has

$$\begin{aligned}\nonumber \text {Cov}(Y,\varPhi \hat{\theta }^{\text {R}}(\hat{\eta }(Y))&=\sigma ^2{\mathscr {E}}[\frac{d f(Y,\hat{\eta })}{d Y}]\\&=\sigma ^2{\mathscr {E}}[\frac{\partial f(Y,\hat{\eta })}{\partial Y}]+\sigma ^2{\mathscr {E}}[\frac{\partial f(Y,\hat{\eta })}{\partial \hat{\eta }}\frac{\partial \hat{\eta }}{\partial Y}]\nonumber \\&=\sigma ^2{\mathscr {E}}[\varPhi (\varPhi ^T\varPhi +\sigma ^2 P^{-1}(\hat{\eta }(Y)))^{-1}\varPhi ^T]+\sigma ^2{\mathscr {E}}[\frac{\partial f(Y,\hat{\eta })}{\partial \hat{\eta }}\frac{\partial \hat{\eta }}{\partial Y}]. \end{aligned}$$

Therefore, it holds that

$$\begin{aligned} \text {EVE}_{\text {in}}&= \overline{\text {err}}(\eta ) + 2\sigma ^2\frac{1}{N}{\mathscr {E}}[{{\,\mathrm{trace}\,}}(\varPhi P(\hat{\eta }(Y)))\varPhi ^T(\varPhi P(\hat{\eta }(Y)))\varPhi ^T+\sigma ^2 I_N)^{-1})]\nonumber \\ {}&\qquad \qquad \qquad \qquad +2\sigma ^2\frac{1}{N}{{\,\mathrm{trace}\,}}({\mathscr {E}}[\frac{\partial f(Y,\hat{\eta })}{\partial \hat{\eta }}\frac{\partial \hat{\eta }}{\partial Y}])\nonumber \\&=\overline{\text {err}}(\eta ) + 2\sigma ^2\frac{\text {dof}(\hat{\theta }^{\text {R}}(\hat{\eta }(Y))}{N} +2\sigma ^2\frac{1}{N}{{\,\mathrm{trace}\,}}({\mathscr {E}}[\frac{\partial f(Y,\hat{\eta })}{\partial \hat{\eta }}\frac{\partial \hat{\eta }}{\partial Y}]). \end{aligned}$$

(3.102)

If $\hat{\eta }=\hat{\eta }(Y)$ were independent of Y, the above objective would coincide with the SURE score reported in (3.97). The difference is instead the presence of the term $2\sigma ^2\frac{1}{N}{{\,\mathrm{trace}\,}}({\mathscr {E}}[\frac{\partial f(Y,\hat{\eta })}{\partial \hat{\eta }}\frac{\partial \hat{\eta }}{\partial Y}])$. It represents the extra optimism induced by the estimation of $\eta $ and is due to the randomness of the data Y entering the hyperparameter estimator. The term ${{\,\mathrm{trace}\,}}({\mathscr {E}}[\frac{\partial f(Y,\hat{\eta })}{\partial \hat{\eta }}\frac{\partial \hat{\eta }}{\partial Y}])$ is called the excess degrees of freedom [33] and denoted by

$$\begin{aligned} \text {exdof}(\hat{\theta }^{\text {R}}(\hat{\eta }(Y))={{\,\mathrm{trace}\,}}({\mathscr {E}}[\frac{\partial f(Y,\hat{\eta })}{\partial \hat{\eta }}\frac{\partial \hat{\eta }}{\partial Y}]). \end{aligned}$$

(3.103)

From (3.102), we readily obtain an unbiased estimator of $\text {EVE}_{\text {in}}$ as follows:

$$\begin{aligned} \nonumber \widehat{\text {EVE}_{\text {in}}}&= \overline{\text {err}}(\eta )_{\mathscr {D}_\text {T}} + 2\sigma ^2\frac{\text {dof}(\hat{\theta }^{\text {R}}(\hat{\eta }(Y))}{N}+2\sigma ^2\frac{\widehat{\text {exdof}(\hat{Y}(\hat{\eta }))}}{N} \\&=\frac{1}{N}\Vert Y-\varPhi \hat{\theta }^{\text {R}}(\hat{\eta }(Y))\Vert _2^2+2\sigma ^2\frac{\text {dof}(\hat{\theta }^{\text {R}}(\hat{\eta }(Y))}{N} \nonumber \\&+2\sigma ^2\frac{1}{N}{{\,\mathrm{trace}\,}}(\frac{\partial f(Y,\hat{\eta })}{\partial \hat{\eta }}\frac{\partial \hat{\eta }}{\partial Y}), \end{aligned}$$

(3.104)

where $\widehat{\text {exdof}(\hat{Y}(\hat{\eta }))}$ is an unbiased estimator of $\text {exdof}(\hat{Y}(\hat{\eta }))$. As discussed in [33], (3.104) can be used to compare different regularized estimators also in terms of the different complexity of the hyperparameters tuning strategies that they adopt.

3.6 Regularized Least Squares with Other Types of Regularizers $\star $

The general ReLS criterion assumes the following form

$$\begin{aligned} \hat{\theta }^{\text {R}}&=\displaystyle \mathop {{\text {arg}}\,{\text {min}}}_\theta \Vert Y-\varPhi \theta \Vert _2^2 + \gamma J(\theta ). \end{aligned}$$

The different choices of the regularization term $J(\theta )$ depend on the prior knowledge regarding $\theta _{0}$. Having discussed the quadratic penalty, we will now consider two other important choices for $J(\theta )$ given by the $\ell _1$- or nuclear norm.

3.6.1 $\ell _1$-Norm Regularization

ReLS with $\ell _1$-norm regularization leads to

$$\begin{aligned} \hat{\theta }^{\text {R}}&=\displaystyle \mathop {{\text {arg}}\,{\text {min}}}_\theta \Vert Y-\varPhi \theta \Vert _2^2 + \gamma \Vert \theta \Vert _1, \end{aligned}$$

(3.105)

where $\Vert \theta \Vert _1$ represents the $\ell _1$-norm of $\theta $, i.e., $\Vert \theta \Vert _1=\sum _{i=1}^n|\theta _i|$ with $\theta _i$ being the ith element of $\theta $. The problem (3.105) is also known as the least absolute shrinkage and selection operator (LASSO) [42] and is equivalently defined as follows:

$$\begin{aligned} \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_\theta \Vert Y-\varPhi \theta \Vert _2^2,\ {{\,\mathrm{subj. to\ }\,}}\Vert \theta \Vert _1\le \beta , \end{aligned}$$

(3.106)

where $\beta \ge 0$ is a tuning parameter connected with $\gamma $ that controls the sparsity of $\theta $.

3.6.1.1 Computation of Sparse Solutions

LASSO (3.105) has been widely used for finding sparse solutions. In signal processing, such problem has wide applications in compressive sensing for finding sparse signal representations from redundant dictionaries. In machine learning and statistics, the problem has also been applied extensively for variable selection where the aim is to select a subset of relevant variables to use in model construction.

Recall that a vector $\theta \in {\mathbb R}^n$ is said to be sparse if $\Vert \theta \Vert _0 \ll n$, where $\Vert \theta \Vert _0$ is the $\ell _0$ norm of $\theta $ which counts the number of nonzero elements of $\theta $. For linear regression models, sparse estimation requires to find a sparse $\theta $ able to well fit the data, i.e., such that $\Vert Y-\varPhi \theta \Vert _2^2$ is small. More formally, the problem is defined as follows:

$$\begin{aligned} \min _\theta \Vert \theta \Vert _0,\ {{\,\mathrm{subj. to\ }\,}}\Vert Y-\varPhi \theta \Vert _2^2\le \varepsilon , \end{aligned}$$

(3.107)

where $Y\in {\mathbb R}^N,\theta \in {\mathbb R}^{n}$ with $n>N$, $\varPhi \in {\mathbb R}^{N\times n}$ assumed of full rank, i.e., ${{\,\mathrm{rank}\,}}(\varPhi )=N$, and $\varepsilon \ge 0$ is a tuning parameter that controls the data fit.

The problem (3.107) is known to be NP-hard, e.g., [31]. It is combinatorial and finding its solution requires an exhaustive search. Hence, one needs approximated methods. The most popular technique relies on a convex relaxation of (3.107) obtained by replacing the $\ell _0$-norm with the $\ell _1$-norm:

$$\begin{aligned} \min _\theta \Vert \theta \Vert _1, \ {{\,\mathrm{subj. to\ }\,}}\Vert Y-\varPhi \theta \Vert _2^2\le \varepsilon . \end{aligned}$$

(3.108)

By using the method of Language multipliers, it can be shown that the convex relation (3.108) is equivalent to LASSO (3.105).

A natural question is whether or not the solution of LASSO (3.105) can be sparse. The answer is affirmative. For illustration, we first show this feature when the regression matrix $\varPhi $ is orthogonal and assuming $N=n$.

3.6.1.2 LASSO Using an Orthogonal Regression Matrix

Let us consider (3.105) with orthogonal regression matrix $\varPhi $, i.e., $\varPhi ^T\varPhi =\varPhi \varPhi ^T=I_n$. Then (3.105) is rearranged as follows:

$$\begin{aligned} \hat{\theta }^\text {R}&=\displaystyle \mathop {{\text {arg}}\,{\text {min}}}_\theta \Vert (\varPhi ^T\varPhi )^{-1} \varPhi ^T(Y-\varPhi \theta )\Vert _2^2 + \gamma \Vert \theta \Vert _1 \nonumber \\&=\displaystyle \mathop {{\text {arg}}\,{\text {min}}}_\theta \Vert \hat{\theta }^{\text {LS}}-\theta \Vert _2^2 + \gamma \Vert \theta \Vert _1 \nonumber \\&=\displaystyle \mathop {{\text {arg}}\,{\text {min}}}_\theta \sum _{i=1}^n (\hat{\theta }_i^{\text {LS}}-\theta _i)^2 + \gamma |\theta _i|, \end{aligned}$$

(3.109)

where $\hat{\theta }_i^{\text {LS}}$ is the ith element of $\hat{\theta }^{\text {LS}}$.

To derive the optimal solution $\hat{\theta }^\text {R}$, we first recall the definition of subderivative and subdifferential of a convex function $f:X\rightarrow {\mathbb R}$ with X being an open interval. The subderivative of a convex function $f:X\rightarrow {\mathbb R}$ at a point $x_0$ in the open interval X is a real number a such that

$$f(x)-f(x_0)\ge a (x-x_0)$$

for all x in X. It can be shown that there exist b and c with $b\le c$ such that the set of subderivatives at $x_0$ for a convex function is a nonempty closed interval $[b,\ c]$, where b and c are the one-sided limits defined as follows:

$$b=\lim _{x\rightarrow x_0^-} \frac{f(x)-f(x_0)}{x-x_0},\quad c=\lim _{x\rightarrow x_0^+} \frac{f(x)-f(x_0)}{x-x_0}.$$

The closed interval $[b,\ c]$ is called the subdifferential of f(x) at the point $x_0$.

Then, considering (3.109), $\hat{\theta }^\text {R}$ is an optimal solution if

$$\begin{aligned} -2(\hat{\theta }_i^{\text {LS}}-\hat{\theta }^\text {R}_i) + \gamma \partial |\hat{\theta }^\text {R}_i|=0, \ i=1,2,\ldots ,n, \end{aligned}$$

(3.110)

where $\hat{\theta }_i^{\text {R}}$ is the ith element of $\hat{\theta }^{\text {R}}$ and $\partial |\hat{\theta }^\text {R}_i|$ represents the subdifferential of $|\hat{\theta }^\text {R}_i|$ which is equal to

$$\begin{aligned} \partial |\hat{\theta }^\text {R}_i|=\left\{ \begin{array}{cc} \{\text {sign}(\hat{\theta }^\text {R}_i)\} &{} \hat{\theta }^\text {R}_i\ne 0 \\ {[-1,\ 1]} &{} \hat{\theta }^\text {R}_i=0 \end{array} \right. , \ i=1,2,\ldots ,n. \end{aligned}$$

(3.111)

Using (3.110) and (3.111), we obtain the following explicit solution of LASSO for orthogonal $\varPhi $:

$$\begin{aligned} \hat{\theta }^\text {R}_i = \text {sign}(\hat{\theta }^{\text {LS}}_i)\min \left\{ 0,|\hat{\theta }^{\text {LS}}_i|-\frac{\gamma }{2}\right\} , \ i=1,2,\ldots ,n. \end{aligned}$$

(3.112)

From (3.112) one can see that the solution of LASSO will be sparse if many absolute values of the elements of $\hat{\theta }^{\text {LS}}$ are smaller than $\gamma /2$. So, $\gamma $ can be used to tune the sparsity of $\theta $. It can also be seen that the nonzero elements of the solution of LASSO are biased and that, compared with the LS solution, they are shrunk towards zero (translated towards zero by a constant factor $\gamma /2$).

3.6.1.3 LASSO Using a Generic Regression Matrix: Geometric Interpretation

For a generic non-orthogonal $\varPhi $, LASSO in general has no explicit solutions. To understand why it can still induce sparse solutions, we can use the geometric interpretation of LASSO in the form of (3.106) with $\theta \in {\mathbb R}^2$. In Fig. 3.12, one can see that for the first case coloured in blue (resp., the third case coloured in brown), if the elliptical contour is rotated slightly about the axis perpendicular to the paper and through the blue (resp., brown) cross, the optimal solution of (3.106) will still have a zero $\theta _1$-element (resp., $\theta _2$-element). This explains why LASSO can often induce sparse solutions with a suitable choice of the regularization parameter.

Finally, since the cost function of LASSO (3.105) is a convex function of $\theta $, many standard convex optimization software packages are available to obtain numerical solutions of LASSO very efficiently, such as YALMIP [26], CVX [19], CVXOPT [3], CVXPY [11].

Example 3.7

(Polynomial regression-LASSO) We revisit the polynomial regression Examples (3.26) and (3.27) with LASSO (3.105). In particular, we set the model order to $n=16$, with the regression matrix $\varPhi $ built according to (3.4) and (3.12). Moreover, we let $\gamma =\gamma _i$, $i=1,\ldots ,16$ with $\gamma _1=0.01$, $\gamma _{16}=0.31$ and $\gamma _2,\ldots ,\gamma _{15}$ evenly spaced between $\gamma _1$ and $\gamma _{16}$. For each $\gamma =\gamma _i$, we compute the corresponding solution of the LASSO (3.105). In particular, the estimates $\hat{g}(x) = \phi (x)^T\hat{\theta }^{\text {R}}$ for $x=x_i$, with $i=1,\ldots ,40$, are plotted in Fig. 3.13.

The model fits (3.28) obtained for different $\gamma $ are shown in Fig. 3.14. One can see that $\gamma =0.15$ gives the best result.

Finally, the LASSO estimates of the components of $\theta $ obtained using the different values of $\gamma $ are shown in Fig. 3.15. It is evident that the LASSO estimate (3.105) is sparse. Comparing it with the ridge regression estimates reported in Fig. 3.11, one can conclude that LASSO may give a simpler model, i.e., depending only on a limited number of components of $\theta $. $\square $

3.6.1.4 Sparsity Inducing Regularizers Beyond the $\ell _1$-Norm

We have seen that the $\ell _1$-norm plays a key role for sparse estimation. However, as shown in [34], there are many other sparsity inducing regularizers. Let l be any concave and nondecreasing function on $[0,\ \infty )$, three examples being reported in the top panel of Fig. 3.16. Then, other penalties which promote sparsity assume the form $J(\theta ) = \sum _{i=1}^n l(\theta _i^2)$ and are given by

$$\begin{aligned} \begin{aligned} l(\eta )&= \eta ^{\frac{p}{2}}, \ p\in (0,2)&\begin{array}{c}\eta =\theta _i^2 \\ \Longrightarrow \end{array} \qquad&J(\theta ) = \sum _{i=1}^n |\theta _i|^p, \ p\in (0,2),\\ l(\eta )&= \log (|\eta |^{\frac{1}{2}}+\varepsilon ), \ \varepsilon >0&\begin{array}{c}\eta =\theta _i^2 \\ \Longrightarrow \end{array} \qquad&J(\theta ) = \sum _{i=1}^n \log (|\theta _i|+\varepsilon ). \end{aligned} \end{aligned}$$

(3.113)

Some of them are displayed in the bottom panel of Fig. 3.16. The use of nonconvex penalties may increase the sparsity in the solution but the drawback is that optimization problems possibly exposed to local minima must be handled.

3.6.1.5 Presence of Outliers and Robust Regression

In practical applications, it may happen that the measurement outputs $y_i$ so far described by the model

$$\begin{aligned} y_i = \phi _i^T\theta _0+e_i,\quad i=1,\ldots ,N \end{aligned}$$

may be contaminated by outliers which represent unexpected noise model deviations. They can be due to the failure of some sensors or to mistakes in the setting of the experiment. In this case, data can actually be generated by the following system:

$$\begin{aligned} y_i = \phi _i^T\theta _0+e_i+v_{0,i},\quad i=1,\ldots ,N, \end{aligned}$$

(3.114)

where the $e_i$ form a white noise with mean zero and variance $\sigma ^2$ while the $v_{0,i}$ represents the outliers which are assumed to be zero most of time. Hence, the vector

$$V_0=\left[ \begin{array}{cccc} v_{0,1} &{} v_{0,2} &{} \ldots &{} v_{0,N} \\ \end{array} \right] ^T $$

is assumed to be sparse.

When data come from (3.114), straightforward application of the LS method may lead to a poor estimate $\hat{\theta }^{\text {LS}}$ of $\theta _0$. For illustration, let us consider an extreme case by assuming $v_{0,i}=0$ for $i=1,2,\ldots ,N-1$ while the $|\phi ^T_i\theta _0+e_i|$ for $i=1,\ldots ,N$ are all negligible compared to $|v_{0,N}|$. LS leads to

$$\begin{aligned} \hat{\theta }^{\text {LS}}&= \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_\theta \sum _{i=1}^N (y_i-\phi ^T_i\theta )^2 \\&= \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_\theta \sum _{i=1}^{N-1} (\phi ^T_i\theta _0+e_i-\phi ^T_i\theta )^2\\ {}&\qquad \qquad \qquad + (\phi ^T_N\theta _0+e_N+v_{0,N}-\phi ^T_N\theta )^2. \end{aligned}$$

The first $N-1$ terms in the above cost function are the same encountered in absence of outliers while the last term is different due to $v_{0,N}$. The $|\phi ^T_i\theta _0+e_i|$, $i=1,\ldots ,N$ are negligible compared to $|v_{0,N}|$, a phenomenon then further amplified by the quadratic criterion here adopted. To make the last term as small as possible, $\hat{\theta }^{\text {LS}}$ will mainly tend to fit only $v_{0,N}$. Hence, the terms $\phi ^T_i\theta _0+e_i$ which carry information on the true system will be little regarded. This will lead to a poor estimate of $\theta _0$.

Many robust regression methods are available hinging on loss functions less sensitive to outliers than the square loss. An example is Huber estimation

$$\begin{aligned} \hat{\theta }^{\text {Huber}} = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_\theta \sum _{i=1}^N l^{\text {Huber}}(y_i-\phi ^T_i\theta ) \end{aligned}$$

(3.115)

where the Huber loss function $l^{\text {Huber}}$ is defined as follows:

$$\begin{aligned} l^{\text {Huber}}(x) = \left\{ \begin{array}{cc} x^2 &{} |x|<\frac{\gamma }{2} \\ \gamma |x|- \frac{1}{4}\gamma ^2 &{} |x|\ge \frac{\gamma }{2} \end{array}\right. . \end{aligned}$$

(3.116)

In (3.116), the parameter $\gamma >0$ is a tuning parameter whose role will become clear shortly. The Huber loss function (3.116) is less sensitive to outliers because it grows linearly for $|x|\ge \gamma /2$. Note that a limit case of the Huber loss is the $\ell _1$-norm obtained with $\gamma $ which tends to zero.

3.6.1.6 An Equivalence Between $\ell _1$-Norm Regularization and Huber Estimation

Let

$$\begin{aligned}&\tilde{y}_i=y_i-\phi ^T_i\theta , \ \ i=1,2,\ldots ,N,\\&\tilde{Y} = \left[ \begin{array}{cccc} \tilde{y}_1 &{} \tilde{y}_2 &{} \ldots &{} \tilde{y}_N \\ \end{array} \right] ^T. \end{aligned}$$

Consider the $\ell _1$-norm regularization given by

$$\begin{aligned} \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{\theta ,V_0}&\sum _{i=1}^N (y_i-\phi ^T_i\theta - v_{0,i})^2 + \gamma |v_{0,i}| \end{aligned}$$

(3.117)

whose peculiarity is to require joint optimization w.r.t. the parameter vector $\theta $ and the outliers $v_{0,i}$ contained in $V_0$. Interestingly, (3.117) is actually equivalent to Huber estimation (3.115), i.e., they have the same optimal solution. To show this, one needs just to prove that

$$\begin{aligned} \sum _{i=1}^N l^{\text {Huber}}(y_i-\phi ^T_i\theta )=\min _{V} \Vert \tilde{Y}-V_0\Vert _2^2 + \gamma \Vert V_0\Vert _1. \end{aligned}$$

(3.118)

The right-hand side of (3.118) corresponds to LASSO (3.105) with an orthogonal regression matrix given by the identity. It thus follows from (3.112) that the components of the optimal solution $\hat{V}_0^{\text {R}}$ admit the following closed-form expression:

$$\begin{aligned} \hat{v}^{\text {R}}_{0,i}=\text {sign}(\tilde{y}_i)\min \left\{ 0,|\tilde{y}_i|-\frac{\gamma }{2}\right\} , \ i=1,2,\ldots ,N. \end{aligned}$$

(3.119)

Now we replace $V_0$ in the cost function of the right-hand side of (3.118) with $\hat{V}_0^{\text {R}}$ and it is straightforward to check that the following identify holds:

$$\begin{aligned} \sum _{i=1}^N l^{\text {Huber}}(y_i-\phi ^T_i\theta )=\Vert \tilde{Y}-\hat{V}_0^{\text {R}}\Vert _2^2 + \gamma \Vert \hat{V}_0^{\text {R}}\Vert _1. \end{aligned}$$

(3.120)

Therefore, (3.117) is indeed equivalent to the Huber estimation (3.115).

3.6.2 Nuclear Norm Regularization

So far the output Y, the parameter $\theta $ and the noise E in (3.13) have been assumed to be vectors. In what follows, we allow them to be matrices and consider the following linear regression model:

$$\begin{aligned} Y&= \varPhi \theta _0+E, \ \ Y\in {\mathbb R}^{N\times m}, \ \ \varPhi \in {\mathbb R}^{N\times n}, \ \ \theta _0\in {\mathbb R}^{n\times m}, \ \ E\in {\mathbb R}^{N\times m}. \end{aligned}$$

(3.121)

The ReLS with nuclear norm regularization takes the following form:

$$\begin{aligned} \hat{\theta }^{\text {R}}&=\displaystyle \mathop {{\text {arg}}\,{\text {min}}}_\theta \Vert Y-\varPhi \theta \Vert _F^2 + \gamma \Vert h(\theta )\Vert _*, \end{aligned}$$

(3.122)

where $\Vert \cdot \Vert _F$ is the Frobenius norm of a matrix, $h(\theta )$ is a matrix that is affine in $\theta $ and $\Vert h(\theta )\Vert _*$ is the nuclear norm of the matrix $h(\theta )$, see also Sect. 3.8.1, the appendix to this chapter, for a brief review of matrix and vector norms.

3.6.2.1 Nuclear Norm Regularization for Matrix Rank Minimization

Matrix rank minimization problems (RMP) are a class of optimization problems that involve minimizing the rank of a matrix subject to convex constraints. They are often encountered in signal processing, image processing and statistics. For example, a typical statistical problem is to obtain a low-rank covariance matrix able to describe some available data and/or consistent with some prior assumptions. Formally, the RMP is defined as follows:

$$\begin{aligned} \text {RMP: } \begin{aligned}&\min _X {{\,\mathrm{rank}\,}}(X)\\&{{\,\mathrm{subj. to\ }\,}}X\in {\mathfrak C}\subset {\mathbb R}^{n\times m}, \end{aligned} \end{aligned}$$

(3.123)

with X belonging to a convex set $\mathfrak C$ while ${{\,\mathrm{rank}\,}}(X)$ describes the order (complexity) of the underlying model.

In general, the RMP (3.123) is NP-hard and thus there is need for approximated methods. Several heuristic methods have been proposed, such as the nuclear norm heuristic [14] and the log-det heuristic [15]. In particular, for a convex set $\mathfrak {C}$ the convex envelope of a function $f:\mathfrak {C}\rightarrow {\mathbb R}$ is defined as the largest convex function g such that $g(x)\le f(x)$ for every $x\in \mathfrak C$, e.g., [22]. For a nonconvex f, solving

$$\begin{aligned} \min _{x\in \mathfrak C} f(x) \end{aligned}$$

(3.124)

may be difficult. In this case, if it is possible to derive the convex envelope g of f, then

$$\begin{aligned} \min _{x\in \mathfrak {C} } g(x)\end{aligned}$$

(3.125)

turns out a convex approximation of (3.124) and, in particular, the minimum of (3.125) can represent a lower bound of that of (3.124). Moreover, if necessary, the minimizing argument of (3.125) can be chosen as the initial point for a more complicated nonconvex local search aiming to solve (3.124).

As shown in Theorem 1 of [13, Chap. 5], the convex envelope of the rank function ${{\,\mathrm{rank}\,}}(X)$ with $X\in \mathfrak {C}=\{X |\Vert X\Vert _2\le 1, X\in {\mathbb R}^{n\times m}\}$ is the nuclear norm of X, i.e., $\Vert X\Vert _*$. As a result, the nuclear norm heuristic to solve the RMP (3.123) is obtained by replacing the rank of X with the nuclear norm of X, i.e.,

$$\begin{aligned} \text {Nuclear norm heuristic: } \begin{aligned}&\min _X \Vert X\Vert _*\\&{{\,\mathrm{subj. to\ }\,}}X\in {\mathfrak C}\subset {\mathbb R}^{n\times m}. \end{aligned} \end{aligned}$$

(3.126)

Without loss of generality, we assume that $X\in \mathfrak {C}=\{X \ | \ \Vert X\Vert _2\le M, X\in {\mathbb R}^{n\times m}\}$ for some $M>0$. Then, from the definition of the convex envelope, for $X\in \mathfrak {C}$ we have

$$\begin{aligned} \left\| \frac{X}{M}\right\| _*\le {{\,\mathrm{rank}\,}}\left( \frac{X}{M}\right) \quad \Longrightarrow \quad \frac{1}{M}\Vert X\Vert _*\le {{\,\mathrm{rank}\,}}(X). \end{aligned}$$

In addition

$$\begin{aligned} \frac{1}{M}\Vert X^{\text {copt}}\Vert _*\le {{\,\mathrm{rank}\,}}(X^{\text {opt}})\le {{\,\mathrm{rank}\,}}(X^{\text {copt}}), \end{aligned}$$

(3.127)

where $X^{\text {opt}}$ and $X^{\text {copt}}$ denote the optimal solution of the RMP (3.123) and that of the nuclear norm heuristic (3.126), respectively. The inequalities in (3.127) thus provide an upper and lower bound for the optimal solution of the RMP (3.123).

As shown in [13, Chap. 5], the nuclear norm heuristic (3.126) can be equivalently formulated as a semidefinite program (SDP):

$$\begin{aligned} \begin{aligned}&\min _{X,Y,Z} {{\,\mathrm{trace}\,}}{Y} + {{\,\mathrm{trace}\,}}{Z}\\&{{\,\mathrm{subj. to\ }\,}}\left[ \begin{array}{cc} Y &{} X \\ X^T &{} Z \\ \end{array} \right] \ge 0, \ X\in \mathfrak {C}, \end{aligned}\end{aligned}$$

(3.128)

where $Y\in {\mathbb R}^{n\times n}, Z\in {\mathbb R}^{m\times m}$ and both Y and Z are symmetric. The SDP problem (3.128) can be solved by interior point methods. For this purpose, some convex optimization software packages which can be used include YALMIP [26], CVX [19], CVXOPT [3] and CVXPY [11].

3.6.2.2 Application in Covariance Matrix Estimation with Low-Rank Structure

Now we go back to the linear regression model (3.121) and the ReLS with nuclear norm regularization (3.122). Consider the problem of covariance matrix estimation with low-rank structure, e.g., [38]. In particular, in (3.121), we take $N=m=n$, let Y be a sample covariance matrix, $\varPhi =I_n$, and $\theta _0$ be a positive semidefinite matrix which has low-rank structure. Moreover, in (3.122), we take $h(\theta )=\theta $. We can then obtain a matrix estimate $\hat{\theta }^{\text {R}}$ with low-rank structure using ReLS with nuclear norm regularization as follows:

$$\begin{aligned} \hat{\theta }^{\text {R}}&=\displaystyle \mathop {{\text {arg}}\,{\text {min}}}_\theta \Vert Y-\theta \Vert _F^2 + \gamma \Vert \theta \Vert _*, \end{aligned}$$

(3.129)

for a suitable choice of $\gamma >0$. An example is reported below.

Example 3.8

(Covariance matrix estimation problem) First, we construct a block-diagonal rank-deficient covariance matrix $\theta _0$ that has 4 blocks denoted by $A_i\in {\mathbb R}^{n_i\times n_i}$ with $n_1=20$, $n_2=10$, $n_3=5$ and $n_4=15$. Using $blkdiag $ to represent a block-diagonal matrix, one thus has $\theta _0=blkdiag (A_1,A_2,A_3,A_4)$. Each $A_i$ is generated by summing up $v_{i,j}v_{i,j}^T$, $j=1,\ldots ,n_i-2$, where the $v_{i,j}$ are $n_i$-dimensional vectors with components independent and uniformly distributed on $[-1,\ 1]$. It comes that $rank (\theta _0)=42$ since the rank of each ith block is $n_i-2$. Then we draw 20000 samples $x_i$ from the Gaussian distribution $\mathscr {N}(0,\theta _0)$. The available measurements are $z_i=x_i+e_i$ where the $e_i$ are independent and distributed as $\mathscr {N}(0,0.6)$. Using the $z_i$ we calculate the sample covariance Y as follows:

$$\begin{aligned} Y=\frac{1}{20000}\sum _{i=1}^{20000} (z_i-\bar{z})(z_i-\bar{z})^T, \quad \bar{z}= \frac{1}{20000} \sum _{i=1}^{20000} z_i. \end{aligned}$$

(3.130)

We solve the ReLS problem (3.129) with the data Y defined above and $\gamma $ in the set $\{0.1411,0.1414,0.1419,0.1423,0.1427\}$, obtaining different estimates $\hat{\theta }^{\text {R}}$ of the covariance matrix.

The top panel of Fig. 3.17 shows the base 10 logarithm of the 50 estimated singular values. Each profile is obtained with a different regularization parameter. Such results show that, seeing the tiny singular values as null, a suitable value of the regularization parameter, like $\gamma =0.1427$, leads to $rank (\hat{\theta }^{\text {R}})=42$. Note in fact that the green curve, which is associated to such $\gamma $, has a jump towards zero when passing from 42 to 43 on the x-axis. The influence of the nuclear norm regularization is also visible in the bottom panel which shows the profile of the relative error of $\hat{\theta }^{\text {R}}$ as a function of $\gamma $. When $\gamma $ is small, e.g., $\gamma =0.1411$, the influence is invisible, $\hat{\theta }^{\text {R}}$ is almost the same as the sample covariance Y and $rank (\hat{\theta }^{\text {R}})=50$. When $\gamma $ becomes larger, the regularization influence becomes more visible, making $\hat{\theta }^{\text {R}}$ closer to the true covariance $\theta _0$. $\square $

3.6.2.3 Vector Case: $\ell _1$-Norm Regularization

The nuclear norm heuristic and inequalities (3.127) also justify the use of the $\ell _1$-norm regularization (3.108) for the problem of finding sparse solutions (3.107).

For the vector case, i.e., $\theta \in {\mathbb R}^{n\times m}$ with $m=1$, we can take X and $\mathfrak {C}$ in the previous section to be $X=\theta $ and $\mathfrak {C}=\{\theta \in {\mathbb R}^n|\Vert Y-\varPhi \theta \Vert _2^2\le \varepsilon \}$. Then it is easy to see that the $\ell _1$-norm is the convex envelope of the $\ell _0$-norm for $\Vert \theta \Vert _\infty \le 1$, i.e.,

$$\Vert \theta \Vert _1\le \Vert \theta \Vert _0, \text { for } \Vert \theta \Vert _\infty \le 1.$$

Then, the RMP (3.123) and the nuclear norm heuristic (3.126) become the problem of finding sparse solutions (3.107) and the $\ell _1$-norm regularization (3.108), respectively. Similar to what is done to obtain (3.127), we assume that $\Vert \theta \Vert _\infty \le M$ for some $M>0$. If $\Vert \theta \Vert _\infty \le M$, one has

$$\begin{aligned} \frac{1}{M}\Vert \theta ^{\text {copt}}\Vert _1\le \Vert \theta ^{\text {opt}}\Vert _0\le \Vert \theta ^{\text {copt}}\Vert _0, \end{aligned}$$

(3.131)

where $\theta ^{\text {opt}}$ and $\theta ^{\text {copt}}$ denote the optimal solution of the problem of finding sparse solution (3.107) and that of the $\ell _1$-norm regularization (3.108), respectively. Similar to the matrix case, (3.131) provides an upper and lower bound for the optimal solution of the sparse estimation problem (3.107).

3.7 Further Topics and Advanced Reading

The systematic treatment of the regression theory is available in many textbooks, e.g., [12, 35]. The noise variance estimation is a critical issue in practical applications and has been discussed in details in [48]. When the regression matrix is ill-conditioned, it is important to make sure that the least squares estimate is calculated in an accurate and efficient way, e.g., [10, 17]. Moreover, for the regularized least squares in quadratic form, the regularization matrix could also be ill-conditioned. In this case, extra care is required in the calculation of both the regularized least squares estimate and the hyperparameter estimates, e.g., [8]. For given data, the quality of a model depends on the control of its complexity, which can be described by different measures in different contexts, e.g., the model order and the equivalent degrees of freedom. A good exposition of model complexity and its selection can be found in [21]. It is worth to mention that the degrees of freedom for LASSO have also been defined and discussed in [43, 51]. In practical applications, there are two key issues for the regularized least squares with quadratic regularization: the design of the regularization matrix and the estimation of the hyperparameter. While the latter issue has been discussed extensively in the literature, e.g., [21, 36, 46, 47], there are much fewer results on the former issue in the context of system identification, as discussed in [7]. The asymptotic properties of some widely used hyperparameter estimators, such as the maximum marginal likelihood estimator, Stein’s unbiased risk estimator, generalized cross-validation, etc., have been reported in [29, 30]. LASSO and its variants have been extremely popular in practical applications, as described in [16, 28, 32, 50]. The nuclear norm heuristic to solve matrix rank minimization problems has wide applications in practical applications, see, e.g., [5, 6, 14, 15, 37]. Beyond the Huber loss function [23], the square loss function can be replaced also by other convex functions like the Vapnik loss function [45] as discussed later on in Chap. 6.

3.8 Appendix

3.8.1 Fundamentals of Linear Algebra

In this section, we review some fundamentals of linear algebra used in this chapter.

3.8.1.1 QR Factorization and Singular Value Decomposition

We begin with giving the definitions of QR factorization and SVD, which are very important decompositions used for many purposes other than solving LS problems.

For any $\varPhi \in {\mathbb R}^{N\times n}$ with $N\ge n$, $\varPhi $ can be decomposed as follows:

$$\begin{aligned} \varPhi = QR, \end{aligned}$$

(3.132)

where $Q\in {\mathbb R}^{N\times N}$ is orthogonal, i.e., $Q^TQ=QQ^T=I_N$, and $R\in {\mathbb R}^{N\times n}$ is upper triangular. Further assume that $\varPhi $ has full rank. Then $\varPhi $ can be decomposed as follows:

$$\begin{aligned} \varPhi = Q_1R_1 \end{aligned}$$

(3.133)

where $Q_1=Q(:,1:n)$ and $R_1=R(1:n,1:n)$ with Q( : , 1 : n) being the matrix consisting of the first n columns of Q and R(1 : n, 1 : n) being the matrix consisting of the first n rows and n columns of R. The factorizations (3.132) and (3.133) are called the full and thin QR factorization, respectively. In particular, when $R_1$ has positive diagonal entries, the thin QR factorization (3.133) is unique.

We start providing the “economy size” definition of the SVD. For any $\varPhi \in {\mathbb R}^{N\times n}$ with $N\ge n$, $\varPhi $ can be decomposed as follows:

$$\begin{aligned} \varPhi = U\varLambda V^T, \end{aligned}$$

(3.134)

where $U\in {\mathbb R}^{N\times n}$ satisfies $U^TU=I_N$, $\varLambda = {{\,\mathrm{diag}\,}}(\sigma _1,\sigma _2,\dots ,\sigma _n)$ with $\sigma _1\ge \sigma _2\ge \dots \ge \sigma _n\ge 0$, and $V\in {\mathbb R}^{n\times n}$ is orthogonal. The factorization (3.134) is called the singular value decomposition (SVD) of $\varPhi $ and the $\sigma _i$, $i=1,\dots ,n$ are called the singular values of $\varPhi $.

The SVD admits also the “full size” formulation, as given in (3.29). One has that (3.134) still holds but U is an orthogonal $N\times N$ matrix and $\varLambda $ is a rectangular $N\times n$ diagonal matrix, while V is still an orthogonal $n\times n$ matrix. In this second formulation, V and U can be associated to orthonormal change of coordinates in the domain and codomain of $\varPhi $ such that, in the new coordinates, the linear operator is diagonal.

3.8.1.2 Vector and Matrix Norms

Important vector norms are the $\ell _1$, $\ell _2$ and $\ell _\infty $ norms. For a given vector $\theta \in {\mathbb R}^n$, they are denoted by $\Vert \theta \Vert _1,\Vert \theta \Vert _2$ and $\Vert \theta \Vert _\infty $, respectively, and are defined as follows:

$$\begin{aligned} \Vert \theta \Vert _1&= \sum _{i=1}^n |\theta _i|, \end{aligned}$$

(3.135)

$$\begin{aligned} \Vert \theta \Vert _2&=\sqrt{ \sum _{i=1}^n \theta _i^2}, \end{aligned}$$

(3.136)

$$\begin{aligned} \Vert \theta \Vert _\infty&= \max \{|\theta _1|,|\theta _2|,\ldots , |\theta _n|\}, \end{aligned}$$

(3.137)

where the $\ell _2$ norm is also known as the Euclidean norm.

Important matrix norms are the nuclear norm, the Frobenius norm and the spectral norm. For a given matrix $\varPhi \in {\mathbb R}^{N\times n}$ with $N\ge n$, these three matrix norms are denoted by $\Vert \varPhi \Vert _*,\Vert \varPhi \Vert _\text {F}$ and $\Vert \varPhi \Vert _2$, respectively, and are defined as follows:

$$\begin{aligned} \Vert \varPhi \Vert _*&= \sum _{i=1}^n \sigma _i(\varPhi ), \end{aligned}$$

(3.138)

$$\begin{aligned} \Vert \varPhi \Vert _\text {F}&=\sqrt{ \sum _{i=1}^N\sum _{j=1}^n \varPhi _{i,j}^2}=\sqrt{\sum _{i=1}^n \sigma _i^2(\varPhi )}, \end{aligned}$$

(3.139)

$$\begin{aligned} \Vert \varPhi \Vert _2&= \sigma _{\text {max}}(\varPhi ), \end{aligned}$$

(3.140)

where $\sigma _i(\varPhi )$ represents the ith largest singular value of $\varPhi $, $\sigma _{\text {max}}(\varPhi )=\sigma _1(\varPhi )$ and $\varPhi _{i,j}$ is the (i, j)th element of $\varPhi $.

Now, we report some properties of the vector and matrix norms. The ith largest singular value of $\varPhi $ is equal to the square root of the ith largest eigenvalue of $\varPhi ^T\varPhi $, or equivalently $\varPhi \varPhi ^T$. If $\varPhi $ is square and positive semidefinite, then the nuclear norm of $\varPhi $ is equal to the trace of $\varPhi $, i.e., $\Vert \varPhi \Vert _*={{\,\mathrm{trace}\,}}(\varPhi )$. For matrices $A,B\in {\mathbb R}^{N\times n}$, we can define the inner product on ${\mathbb R}^{N\times n}\times {\mathbb R}^{N\times n}$ as $\langle A,B\rangle ={{\,\mathrm{trace}\,}}(A^TB)=\sum _{i=1}^N\sum _{j=1}^nA_{i,j}B_{i,j}$. So the Frobenius norm is the norm associated with this inner product. The spectral norm is defined as the induced 2-norm, i.e., for $\varPhi \in {\mathbb R}^{N\times n}$,

$$\begin{aligned} \Vert \varPhi \Vert _2 = \mathop {{{\,\mathrm{maximize\ }\,}}}\limits _{\theta \ne 0} \frac{\Vert \varPhi \theta \Vert _2}{\Vert \theta \Vert _2} = \mathop {{{\,\mathrm{maximize\ }\,}}}\limits _{\Vert \theta \Vert _2=1} \Vert \varPhi \theta \Vert _2. \end{aligned}$$

(3.141)

To show that (3.141) is equal to (3.140), note that $\max _{\Vert \theta \Vert _2=1} \Vert \varPhi \theta \Vert _2$ is equivalent to $\max _{\Vert \theta \Vert _2^2=1} \Vert \varPhi \theta \Vert _2^2$, which is further equivalent to

$$\begin{aligned} \max _\theta \Vert \varPhi \theta \Vert _2^2 + \lambda (1-\Vert \theta \Vert _2^2)=\max _\theta \theta ^T\varPhi ^T\varPhi \theta + \lambda (1-\theta ^T\theta ), \end{aligned}$$

(3.142)

where $\lambda $ is the Lagrange multiplier. Checking the optimality condition of (3.142) yields that the optimal solution will satisfy

$$\begin{aligned} \varPhi ^T\varPhi \theta - \lambda \theta = 0, \ \theta ^T\theta = 1. \end{aligned}$$

The above equation implies that $\lambda $ is an eigenvalue of $\varPhi ^T\varPhi $, and moreover,

$$\begin{aligned} \theta ^T\varPhi ^T\varPhi \theta = \lambda \theta ^T\theta =\lambda . \end{aligned}$$

(3.143)

As a result, we have

$$\begin{aligned} \max _{\Vert \theta \Vert _2=1} \Vert \varPhi \theta \Vert _2 = ( \max _{\Vert \theta \Vert _2^2=1} \theta ^T\varPhi ^T\varPhi \theta )^{\frac{1}{2}} = ( \max _{\Vert \theta \Vert _2^2=1} \lambda )^{\frac{1}{2}} = (\lambda _{\text {max}})^{\frac{1}{2}}, \end{aligned}$$

where $\lambda _{\text {max}}$ is the largest eigenvalue of $\varPhi ^T\varPhi $ that is equal to $\sigma _{\text {max}}^2(\varPhi )$. Thus (3.141) is indeed equal to (3.140).

The aforementioned three matrix norms, the nuclear norm, the Frobenius norm and the spectral norm, can be seen as natural extensions of the three vector norms: the $\ell _1$, $\ell _2$ and $\ell _\infty $ norms, , respectively. In particular, if we construct an n-dimensional vector with the n singular values of $\varPhi $ as its elements, then the three matrix norms $\Vert \varPhi \Vert _*,\Vert \varPhi \Vert _{\text {F}}$ and $\Vert \varPhi \Vert _2$ correspond to the $\ell _1$, $\ell _2$ and $\ell _\infty $ norms of the constructed vector, respectively. Moreover, for any given norm $\Vert \cdot \Vert $ on ${\mathbb R}^{N\times n}$, there exists a dual norm $\Vert \cdot \Vert _\text {d}$ of $\Vert \cdot \Vert $ defined as

$$\begin{aligned} \Vert A\Vert _\text {d}=\sup \{{{\,\mathrm{trace}\,}}(A^TB)| B\in {\mathbb R}^{N\times n}, \Vert B\Vert \le 1\}. \end{aligned}$$

(3.144)

For the vector norms, the dual norm of the $\ell _1$ norm is the $\ell _\infty $ norm and the dual norm of the $\ell _2$ norm is the $\ell _2$ norm. The properties for the vector norms extend to the matrix norms we have defined: the dual norm of the nuclear norm is the spectral norm, see, e.g., [37], and the dual norm of the Frobenius norm is itself.

3.8.1.3 Matrix Inversion Lemma, Based on [49]

The matrix inversion lemma is also known as Sherman–Morrison–Woodbury formula and refers to the following identity:

$$\begin{aligned} (A+UCV)^{-1} = A^{-1} - A^{-1} U(C^{-1} + V A^{-1} U) ^{-1} V A^{-1}, \end{aligned}$$

(3.145)

where A and C are square $n \times n$ and $m \times m$ matrices.

3.8.2 Proof of Lemma 3.1

Define $W=-(QR+I_n)^{-1}$ and $W_0=-(ZR+I_n)^{-1}$. Then (3.67) can be rewritten as

$$\begin{aligned}&W(QRQ+Z)W^T\ge W_0(ZRZ+Z)W_0^T. \end{aligned}$$

(3.146)

Note that

$$\begin{aligned} I_n+W = -WQR,\qquad I_n+W_0 = -W_0ZR \end{aligned}$$

(3.147)

thus (3.67) can be further rewritten as

$$\begin{aligned}&(I_n+W)R^{-1}(I_n+W)^T+WZW^T\nonumber \\&\quad \ge (I_n+W_0)R^{-1}(I_n+W_0)^T+W_0ZW_0^T. \end{aligned}$$

(3.148)

In the following, we show that

$$\begin{aligned} \nonumber&(I_n+W)R^{-1}(I_n+W)^T+WZW^T\\ \nonumber&\qquad -(I_n+W_0)R^{-1}(I_n+W_0)^T-W_0ZW_0^T\\&\qquad =(W-W_0)(R^{-1}+Z)(W-W_0)^T. \end{aligned}$$

(3.149)

Simple calculation shows that (3.149) is equivalent to

$$\begin{aligned} \nonumber&(I_n+W_0)R^{-1}W^T+WR^{-1}(I_n+W_0^T)\\ \nonumber&\qquad -(I+W_0)R^{-1}W_0^T -W_0R^{-1}(I_n+W_0^T)\\&\qquad =2W_0ZW_0^T-W_0ZW^T-WZW_0^T. \end{aligned}$$

(3.150)

It follows from the second equation of (3.147) that

$$\begin{aligned} (I_n+W_0)R^{-1}=-W_0Z. \end{aligned}$$

(3.151)

Now inserting (3.151) into the left-hand side of (3.150) shows that (3.150) and thus (3.149) holds. Moreover, since $(W-W_0)(R^{-1}+Z)(W-W_0)^T$ in (3.149) is positive semidefinite, Eq. (3.148) holds as well, which in turn implies (3.67) holds. This completes the proof.

3.8.3 Derivation of Predicted Residual Error Sum of Squares (PRESS)

For the case when the kth measured output $y_k$, $k=1,\ldots ,N$, is not used, the corresponding ReLS-Q estimate becomes

$$\begin{aligned} \hat{\theta }_{-k}^\text {R} =\left( \sum _{i=1,i\ne k}^N \phi _i\phi _i^T + \sigma ^2 P^{-1}(\eta )\right) ^{-1}\sum _{i=1,i\ne k}^N \phi _iy_i. \end{aligned}$$

(3.152)

For the kth measured output $y_k$, $k=1,\ldots ,N$, the corresponding predicted output $\hat{y}_{-k}$ and validation error $r_{-k}$ are

$$\begin{aligned} \hat{y}_{-k}&= \phi _k^T \left( \sum _{i=1,i\ne k}^N \phi _i\phi _i^T + \sigma ^2 P^{-1}(\eta )\right) ^{-1}\sum _{i=1,i\ne k}^N \phi _iy_i, \end{aligned}$$

(3.153a)

$$\begin{aligned} r_{-k}&= y_k - \hat{y}_{-k}. \end{aligned}$$

(3.153b)

With M defined in (3.81) and by Woodbury matrix identity, e.g., [10, 17], we have

$$\begin{aligned} \nonumber \left( \sum _{i=1,i\ne k}^N \phi _i\phi _i^T + \sigma ^2 P^{-1}(\eta )\right) ^{-1}&= (M - \phi _k\phi _k^T)^{-1} \\&=M^{-1}-\frac{ M^{-1}\phi _k\phi _k^TM^{-1}}{-1+\phi _k^xTM^{-1}\phi _k}. \end{aligned}$$

(3.154)

Then we have

$$\begin{aligned} \begin{aligned} r_{-k}&= y_k - \phi _k^TM^{-1}\sum _{i=1,i\ne k}^N \phi _iy_i + \phi _k^T\frac{ M^{-1}\phi _k\phi _k^TM^{-1}}{-1+\phi _k^TM^{-1}\phi _k}\sum _{i=1,i\ne k}^N \phi _iy_i\\ {}&= r_k + \phi _k^TM^{-1} \phi _ky_k+ \phi _k^T\frac{ M^{-1}\phi _k\phi _k^TM^{-1}}{-1+\phi _k^TM^{-1}\phi _k}\sum _{i=1,i\ne k}^N \phi _iy_i\\ {}&= r_k + \phi _k^TM^{-1} \phi _k\left( y_k + \frac{\phi _k^TM^{-1}}{-1+\phi _k^TM^{-1}\phi _k}\sum _{i=1,i\ne k}^N \phi _iy_i\right) \\&=r_k + \frac{\phi _k^TM^{-1} \phi _k}{-1+\phi _k^TM^{-1}\phi _k}\\ {}&\qquad \times \left( -y_k +\phi _k^TM^{-1}\phi _ky_k+\phi _k^TM^{-1}\sum _{i=1,i\ne k}^N \phi _iy_i\right) \\ {}&= r_k - \frac{\phi _k^TM^{-1} \phi _k}{-1+\phi _k^TM^{-1}\phi _k}r_k \\&= r_k\frac{1}{1-\phi _k^TM^{-1}\phi _k}, \end{aligned} \end{aligned}$$

(3.155)

which shows that $r_{-k}$ is actually obtained by scaling $r_k$ with a factor $1/(1-\phi _k^TM^{-1}\phi _k)$. Accordingly, we have the sum of squares of the validation errors

$$\begin{aligned} \sum _{k=1}^N r_{-k}^2 =\sum _{k=1}^N \frac{r_k^2}{(1-\phi _k^TM^{-1}\phi _k)^2}. \end{aligned}$$

(3.156)

Then the PRESS (3.80) is obtained by minimizing (3.156) with respect to $\eta \in \varGamma $.

3.8.4 Proof of Theorem 3.7

Using (3.92) and (3.100), it is easy to see that proving (3.90) is equivalent to show that

$$\begin{aligned} \underbrace{{\mathscr {E}}\left[ \frac{1}{N} \Vert Y-\varPhi \hat{\theta }^{\text {R}}(\eta )\Vert _2^2\right] }_{\overline{\text {err}}(\eta )}\le \underbrace{{\mathscr {E}}\left[ \frac{1}{N}{\mathscr {E}}[\Vert Y_{\text {v}}-\varPhi \hat{\theta }^{\text {R}}(\eta )\Vert _2^2|\mathscr {D}_\text {T}]\right] }_{\text {EVE}_{\text {in}}(\eta )} \end{aligned}$$

(3.157)

and to prove the above inequality we need the following lemma.

Lemma 3.3

Consider the following additive measurement model:

$$\begin{aligned} x=\mu +\varepsilon ,\ x,\mu ,\varepsilon \in {\mathbb R}^p, \end{aligned}$$

(3.158)

where $\mu $ is an unknown constant vector and $\varepsilon $ is a random variable with zero-mean and covariance matrix ${\mathscr {E}}(\varepsilon \varepsilon ^T)=\varSigma $. Let $\hat{\mu }(x)$ be an estimator of $\mu $ based on the data x and let $\tilde{x}$ be new data generated from

$$\begin{aligned} \tilde{x}=\mu +\tilde{\varepsilon },\ \tilde{x}\in {\mathbb R}^p, \end{aligned}$$

(3.159)

where $\tilde{\varepsilon }$ is a random variable uncorrelated with $\varepsilon $ and has zero-mean and covariance matrix ${\mathscr {E}}(\tilde{\varepsilon }\tilde{\varepsilon }^T)=\varSigma $. Then it holds that

$$\begin{aligned} {\mathscr {E}}(\Vert \tilde{x}-\hat{\mu }(x)\Vert _2^2)&= {\mathscr {E}}(\Vert \mu -\hat{\mu }(x)\Vert _2^2) + {{\,\mathrm{trace}\,}}(\varSigma ) \end{aligned}$$

(3.160)

$$\begin{aligned}&={\mathscr {E}}(\Vert x-\hat{\mu }(x)\Vert _2^2) + 2{{\,\mathrm{trace}\,}}(Cov (\hat{\mu }(x),x)), \end{aligned}$$

(3.161)

where the expectation is over both $\varepsilon $ and $\tilde{\varepsilon }$.

Proof

Firstly, we consider (3.160). We have

$$\begin{aligned} {\mathscr {E}}&(\Vert \tilde{x}-\hat{\mu }(x)\Vert _2^2)={\mathscr {E}}(\Vert \tilde{x}-\mu +\mu -\hat{\mu }(x)\Vert _2^2)\\&={\mathscr {E}}(\Vert \mu -\hat{\mu }(x)\Vert _2^2)+{\mathscr {E}}(\Vert \tilde{x}-\mu \Vert _2^2)+2{\mathscr {E}}[(\tilde{x}-\mu )^T(\mu -\hat{\mu }(x))]\\ {}&={\mathscr {E}}(\Vert \mu -\hat{\mu }(x)\Vert _2^2)+{\mathscr {E}}(\Vert \tilde{\varepsilon }\Vert _2^2), \end{aligned}$$

which shows that (3.160) is true.

Secondly, we consider (3.161). Similarly, we have

$$\begin{aligned} {\mathscr {E}}&(\Vert \tilde{x}-\hat{\mu }(x)\Vert _2^2)={\mathscr {E}}(\Vert \tilde{x}-x+x-\hat{\mu }(x)\Vert _2^2)\\&={\mathscr {E}}(\Vert x-\hat{\mu }(x)\Vert _2^2)+{\mathscr {E}}(\Vert \tilde{x}-x\Vert _2^2)+2{\mathscr {E}}[(\tilde{x}-x)^T(x-\hat{\mu }(x))]\\ {}&={\mathscr {E}}(\Vert x-\hat{\mu }(x)\Vert _2^2)+{\mathscr {E}}(\Vert \tilde{\varepsilon }-\varepsilon \Vert _2^2) + 2{\mathscr {E}}[(\tilde{\varepsilon }-\varepsilon )^T(\varepsilon +\mu -\hat{\mu }(x))]\\ {}&={\mathscr {E}}(\Vert x-\hat{\mu }(x)\Vert _2^2)+2{{\,\mathrm{trace}\,}}(\varSigma ) - 2{\mathscr {E}}[\varepsilon ^T(\varepsilon +\mu -\hat{\mu }(x))]\\ {}&={\mathscr {E}}(\Vert x-\hat{\mu }(x)\Vert _2^2) + 2{\mathscr {E}}[\varepsilon ^T\hat{\mu }(x)], \end{aligned}$$

which implies that (3.161) is true. $\square $

Now we prove (3.157) by applying Lemma 3.3. Let

$$\begin{aligned} x=Y,\mu =\varPhi \theta _0,\hat{\mu }(x) = \varPhi \hat{\theta }^{\text {R}},\tilde{x} = Y_{\text {v}}, \varepsilon =E, \tilde{\varepsilon }=E_{\text {test}}, \varSigma =\sigma ^2 I_N, \end{aligned}$$

(3.162)

and then it follows from (3.161) that

$$\begin{aligned}&\underbrace{{\mathscr {E}}\left[ \frac{1}{N}{\mathscr {E}}[\Vert Y_{\text {v}}-\varPhi \hat{\theta }^{\text {R}}(\eta )\Vert _2^2|\mathscr {D}_\text {T}]\right] }_{\text {EVE}_{\text {in}}(\eta )}-\underbrace{{\mathscr {E}}\left[ \frac{1}{N} \Vert Y-\varPhi \hat{\theta }^{\text {R}}(\eta )\Vert _2^2\right] }_{\overline{\text {err}}(\eta )} \nonumber \\&\qquad \qquad =2\frac{1}{N}{{\,\mathrm{trace}\,}}(\text {Cov}(Y,\varPhi \hat{\theta }^{\text {R}}(\eta ))). \end{aligned}$$

(3.163)

Next we show that the right-hand side of (3.163) is nonnegative. For the ReLS-Q problem (3.58a) with the ReLS-Q estimate (3.58b), the predicted output $\hat{Y}(\eta )$ of Y is

$$\hat{Y}(\eta ) = \varPhi \hat{\theta }^{\text {R}}(\eta )=\varPhi P\varPhi ^T(\varPhi P\varPhi ^T+\sigma ^2 I_N)^{-1}Y.$$

Then we have

$$\begin{aligned} \text {Cov}(Y,\varPhi \hat{\theta }^{\text {R}}(\eta ))&=\text {Cov}(Y,\hat{Y}(\eta ))\nonumber \\&= {\mathscr {E}}(Y-{\mathscr {E}}(Y))(\hat{Y}-{\mathscr {E}}(\hat{Y}(\eta )))^T\nonumber \\ {}&= {\mathscr {E}}(Y-{\mathscr {E}}(Y))( Y-{\mathscr {E}}( Y))^T\varPhi P\varPhi ^T(\varPhi P\varPhi ^T+\sigma ^2 I_N)^{-1}\nonumber \\ {}&=\sigma ^2 \varPhi P\varPhi ^T(\varPhi P\varPhi ^T+\sigma ^2 I_N)^{-1}=\sigma ^2 H, \end{aligned}$$

(3.164)

where H is the hat matrix defined in (3.63). One has

$${{\,\mathrm{trace}\,}}(\text {Cov}(Y,\varPhi \hat{\theta }^{\text {R}}(\eta )))=\sigma ^2{{\,\mathrm{trace}\,}}(H)\ge 0.$$

Therefore, the right-hand side of (3.163) is nonnegative and thus (3.90) holds true completing the proof of Theorem 3.7.

3.8.5 A Variant of the Expected In-Sample Validation Error and Its Unbiased Estimator

It is possible to derive variants of the expected in-sample validation error and its unbiased estimator by modifying (3.92) and (3.100).

Assume that $\varPhi $ is full rank, i.e., ${{\,\mathrm{rank}\,}}(\varPhi )=n$. Then, multiplying both sides of (3.92) and (3.100) with $(\varPhi ^T\varPhi )^{-1}\varPhi ^T$ yields

$$\begin{aligned} (\varPhi ^T\varPhi )^{-1}\varPhi ^TY&=\theta _0+(\varPhi ^T\varPhi )^{-1}\varPhi ^TE, \end{aligned}$$

(3.165)

$$\begin{aligned} (\varPhi ^T\varPhi )^{-1}\varPhi ^TY_{\text {v}}&=\theta _0+(\varPhi ^T\varPhi )^{-1}\varPhi ^TE_{\text {v}}, \end{aligned}$$

(3.166)

which will be our new “true system” and new “validation data”, respectively.

Different from (3.162), we now take

$$\begin{aligned}&x=(\varPhi ^T\varPhi )^{-1}\varPhi ^TY,\mu =\theta _0,\hat{\mu }(x) = \hat{\theta }^{\text {R}}(\eta ),\tilde{x} = (\varPhi ^T\varPhi )^{-1}\varPhi ^TY_{\text {v}},\nonumber \\&\varepsilon =(\varPhi ^T\varPhi )^{-1}\varPhi ^TE, \tilde{\varepsilon }=(\varPhi ^T\varPhi )^{-1}\varPhi ^TE_{\text {v}}, \varSigma =\sigma ^2 (\varPhi ^T\varPhi )^{-1}. \end{aligned}$$

(3.167)

Note that $\hat{\theta }^{\text {LS}}=(\varPhi ^T\varPhi )^{-1}\varPhi ^TY$ and then it follows from (3.160) and (3.161) that

$$\begin{aligned} {\mathscr {E}}(\Vert (\varPhi ^T\varPhi )^{-1}\varPhi ^TY_{\text {v}}-\hat{\theta }^{\text {R}}(\eta )\Vert _2^2)&= {\mathscr {E}}(\Vert \hat{\theta }^{\text {R}}(\eta )-\theta _0\Vert _2^2) + \sigma ^2{{\,\mathrm{trace}\,}}((\varPhi ^T\varPhi )^{-1})\\ {}&={\mathscr {E}}(\Vert \hat{\theta }^{\text {LS}}-\hat{\theta }^{\text {R}}(\eta )\Vert _2^2) + 2{{\,\mathrm{trace}\,}}(\text {Cov}(\hat{\theta }^{\text {R}}(\eta ),\hat{\theta }^{\text {LS}})). \end{aligned}$$

From the above two equations, we have

$$\begin{aligned} {\mathscr {E}}(\Vert \hat{\theta }^{\text {R}}(\eta )-\theta _0\Vert _2^2)&= {\mathscr {E}}(\Vert \hat{\theta }^{\text {LS}}-\hat{\theta }^{\text {R}}(\eta )\Vert _2^2)\\ {}&\qquad + 2{{\,\mathrm{trace}\,}}(\text {Cov}(\hat{\theta }^{\text {R}}(\eta ),\hat{\theta }^{\text {LS}}))-\sigma ^2{{\,\mathrm{trace}\,}}((\varPhi ^T\varPhi )^{-1}). \end{aligned}$$

Further note that

$$\begin{aligned}&\hat{\theta }^{\text {R}}(\eta )=(\varPhi ^T\varPhi +\sigma ^2P^{-1}(\eta ))^{-1}\varPhi ^TY=(\varPhi ^T\varPhi +\sigma ^2P^{-1}(\eta ))^{-1}\varPhi ^T\varPhi \hat{\theta }^{\text {LS}}, \\&\text {Cov}(\hat{\theta }^{\text {LS}}, \hat{\theta }^{\text {LS}})=\sigma ^2(\varPhi ^T\varPhi )^{-1}, \end{aligned}$$

then we have

$$\begin{aligned} \nonumber {\mathscr {E}}(\Vert \hat{\theta }^{\text {R}}(\eta )-\theta _0\Vert _2^2)=\,&{\mathscr {E}}(\Vert \hat{\theta }^{\text {LS}}-\hat{\theta }^{\text {R}}(\eta )\Vert _2^2) \\&+ 2\sigma ^2{{\,\mathrm{trace}\,}}((\varPhi ^T\varPhi +\sigma ^2P^{-1}(\eta ))^{-1}-0.5(\varPhi ^T\varPhi )^{-1}). \end{aligned}$$

(3.168)

Note that ${\mathscr {E}}(\Vert \hat{\theta }^{\text {R}}(\eta )-\theta _0\Vert _2^2)$ is equal to ${{\,\mathrm{trace}\,}}(\text {MSE}({\hat{\theta }^{\text {R}}(\eta )},\theta _0))$, then we denote it by $\text {mse}_{\eta }$ and we readily obtain an unbiased estimator of $\text {mse}_{\eta }$ as follows:

$$\begin{aligned} \widehat{\text {mse}_{\eta }}&= \Vert \hat{\theta }^{\text {LS}}-\hat{\theta }^{\text {R}}(\eta )\Vert _2^2 + 2\sigma ^2{{\,\mathrm{trace}\,}}((\varPhi ^T\varPhi +\sigma ^2P^{-1}(\eta ))^{-1}-0.5(\varPhi ^T\varPhi )^{-1}). \end{aligned}$$

(3.169)

Now given the training data (3.84), the corresponding estimate $\widehat{\text {mse}_{\eta }} $ of $\text {mse}_{\eta }$ can be used to estimate the hyperparameter $\eta $: we should take the value of $\eta \in \varGamma $ that minimizes (3.169), i.e.,

$$\begin{aligned} \hat{\eta } = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{\eta \in \varGamma } \Vert \hat{\theta }^{\text {LS}}-\hat{\theta }^{\text {R}}(\eta )\Vert _2^2 + 2\sigma ^2{{\,\mathrm{trace}\,}}((\varPhi ^T\varPhi +\sigma ^2P^{-1}(\eta ))^{-1}-0.5(\varPhi ^T\varPhi )^{-1}). \end{aligned}$$

(3.170)

The criterion (3.170) is known as the SURE of the expected in-sample validation error for the true system (3.165) and the validation data (3.166), e.g., [33, 40].

Notes

1.
Recall that the column rank (resp., the row rank) of a matrix is the dimension of the space spanned by the columns (resp., the rows) of the matrix. It is a fundamental result in linear algebra that the column rank and the row rank of a matrix are always equal and this number is called the rank of the matrix. A matrix is said to be full rank if its rank is equal to the lesser of the number of rows and columns and a matrix is said to be rank deficient otherwise.
2.
The step from (3.58c) to (3.58d) follows from the matrix equality $A(I_j+BA)^{-1}=(I_k+AB)^{-1}A$ which holds for every $A\in {\mathbb R}^{k\times j}$ and $B\in {\mathbb R}^{j\times k}$.

References

Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control AC–19:716–723
Article MathSciNet Google Scholar
Allen DM (1974) The relationship between variable selection and data augmentation and a method for prediction. Technometrics 16(1):125–127
Article MathSciNet Google Scholar
Andersen MS, Dahl J, Vandenberghe L (2012) CVXOPT: a Python package for convex optimization, version 1.1.5. http://abel.ee.ucla.edu/cvxopt
Bishop CM (2006) Pattern recognition and machine learning. Springer, New York
MATH Google Scholar
Candès EJ, Recht B (2009) Exact matrix completion via convex optimization. Found Comput Math 9(6):717
Article MathSciNet Google Scholar
Candès EJ, Tao T (2010) The power of convex relaxation: near-optimal matrix completion. IEEE Trans Inf Theory 56(5):2053–2080
Article MathSciNet Google Scholar
Chen T (2018) On kernel design for regularized LTI system identification. Automatica 90:109–122
Article MathSciNet Google Scholar
Chen T, Ljung L (2013) Implementation of algorithms for tuning parameters in regularized least squares problems in system identification. Automatica 49:2213–2220
Article MathSciNet Google Scholar
Chen T, Ohlsson H, Ljung L (2012) On the estimation of transfer functions, regularizations and Gaussian processes - revisited. Automatica 48:1525–1535
Article MathSciNet Google Scholar
Demmel JW (1997) Applied numerical linear algebra. SIAM, Philadelphia
Book Google Scholar
Diamond S, Boyd S (2016) CVXPY: a Python-embedded modeling language for convex optimization. J Mach Learn Res 17:1–5
MathSciNet MATH Google Scholar
Draper NR, Smith H (1981) Applied regression analysis, 2nd edn. Wiley, New York
Google Scholar
Fazel M (2002) Matrix rank minimization with applications. PhD thesis, Department of Electrical Engineering, Stanford University
Google Scholar
Fazel M, Hindi H, Boyd SP (2001) A rank minimization heuristic with application to minimum order system approximation. In: Proceedings of the 2001 American control conference, pp 4734–4739
Google Scholar
Fazel M, Hindi H, Boyd SP (2003) Log-det heuristic for matrix rank minimization with applications to Hankel and Euclidean distance matrices. In: Proceedings of the 2003 American control conference, vol 3, pp 2156–2162
Google Scholar
Friedman J, Hastie T, Tibshirani R (2008) Sparse inverse covariance estimation with the graphical Lasso. Biostatistics 9(3):432–441
Article Google Scholar
Golub GH, Van Loan CF (2013) Matrix computations, 4th edn. The Johns Hopkins University Press, Baltimore
MATH Google Scholar
Golub GH, Heath M, Wahba G (1979) Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21(2):215–223
Article MathSciNet Google Scholar
Grant M, Boyd S, Ye Y (2009) MATLAB software for disciplined convex programming
Google Scholar
Grenander U, Szegö G (1956) Toeplitz forms and their applications, vol 321. University of California Press
Google Scholar
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer, Berlin
Google Scholar
Hiriart-Urruty JB, Lemaréchal C (1993) Convex analysis and minimization algorithms II: advanced theory and bundle methods
Google Scholar
Huber PJ (1981) Robust statistics. Wiley, New York
Book Google Scholar
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, San Francisco, CA, USA, pp 1137–1143
Google Scholar
Ljung L (1999) System identification - theory for the user, 2nd edn. Prentice-Hall, Upper Saddle River
MATH Google Scholar
Lofberg J (2004) YALMIP: a toolbox for modeling and optimization in MATLAB. In: 2004 IEEE international symposium on computer aided control systems design, pp 284–289
Google Scholar
Mallows CL (1973) Some comments on CP. Technometrics 15(4):661–675
MATH Google Scholar
Meinshausen N, Buhlmann P (2006) High-dimensional graphs and variable selection with the Lasso. Ann Stat 34(3):1436–1462
Google Scholar
Mu B, Chen T, Ljung L (2018) On asymptotic properties of hyperparameter estimators for kernel-based regularization methods. Automatica 94:381–395
Article MathSciNet Google Scholar
Mu B, Chen T, Ljung L (2018) Asymptotic properties of hyperparameter estimators by using cross-validations for regularized system identification. In: Proceedings of the 57th IEEE conference on decision and control, pp 644–649
Google Scholar
Natarajan BK (1995) Sparse approximate solutions to linear systems. SIAM J Comput 24(2):227–234
Article MathSciNet Google Scholar
Park T, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103(482):681–686
Article MathSciNet Google Scholar
Pillonetto G, Chiuso A (2015) Tuning complexity in regularized kernel-based regression and linear system identification: the robustness of the marginal likelihood estimator. Automatica 58:106–117
Article MathSciNet Google Scholar
Rao BD, Engan K, Cotter SF, Palmer J, Kreutz-Delgado K (2003) Subset selection in noise based on diversity measure minimization. IEEE Trans Signal Process 51(3):760–770
Article Google Scholar
Rao CR (1973) Linear statistical inference and its applications. Wiley, New York
Book Google Scholar
Rasmussen CE, Williams CKI (2006) Gaussian processes for machine learning. The MIT Press, Cambridge
Google Scholar
Recht B, Fazel M, Parrilo P (2010) Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev 52(3):471–501
Article MathSciNet Google Scholar
Richard E, Savalle P, Vayatis N (2012) Estimation of simultaneously sparse and low rank matrices. In: The 29th international conference on machine learning (ICML)
Google Scholar
Rissanen J (1978) Modelling by shortest data description. Automatica 14:465–471
Article Google Scholar
Stein C (1981) Estimation of the mean of a multivariate normal distribution. Ann Stat 9:1135–1151
Article MathSciNet Google Scholar
Stone M (1974) Cross-validatory choice and assessment of statistical predictions. J R Stat Soc Ser B Stat Methodol 111–147
Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B Stat Methodol 58:267–288
Google Scholar
Tibshirani R, Taylor J (2012) Degrees of freedom in Lasso problems. Ann Stat 40(2):1198–1232
Google Scholar
Toeplitz O (1911) Zur theorie der quadratischen und bilinearen formen von unendlichvielen veränderlichen. Math Ann 70(3):351–376
Article MathSciNet Google Scholar
Vapnik V (1998) Statistical learning theory. Wiley, New York
MATH Google Scholar
Wahba G (1990) Spline models for observational data. SIAM, Philadelphia
Book Google Scholar
Wahba G (1999) Support vector machines, reproducing kernel Hilbert spaces, and the randomized GACV. In: Scholkopf B, Burges C, Smola A (eds) Advances in kernel methods - support vector learning. MIT Press, Cambridge, pp 69–88
Google Scholar
Wolter KM (2007) Introduction to variance estimation, 2nd edn. Springer, Berlin
Google Scholar
Woodbury MA (1950) Inverting modified matrices. Memorandum Rept. 42. Princeton University, Princeton
Google Scholar
Zou H (2006) The adaptive Lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429
Article MathSciNet Google Scholar
Zou H, Hastie T, Tibshirani R (2007) On the degrees of freedom of the Lasso. Ann Stat 35(5):2173–2192
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Engineering, University of Padova, Padova, Italy
Gianluigi Pillonetto & Alessandro Chiuso
School of Data Science, The Chinese University of Hong Kong, Shenzhen, China
Tianshi Chen
Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy
Giuseppe De Nicolao
Department of Electrical Engineering, Linköping University, Linköping, Sweden
Lennart Ljung

Authors

Gianluigi Pillonetto
View author publications
You can also search for this author in PubMed Google Scholar
Tianshi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Chiuso
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe De Nicolao
View author publications
You can also search for this author in PubMed Google Scholar
Lennart Ljung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gianluigi Pillonetto .

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Pillonetto, G., Chen, T., Chiuso, A., De Nicolao, G., Ljung, L. (2022). Regularization of Linear Regression Models. In: Regularized System Identification. Communications and Control Engineering. Springer, Cham. https://doi.org/10.1007/978-3-030-95860-2_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-95860-2_3
Published: 14 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-95859-6
Online ISBN: 978-3-030-95860-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Regularization of Linear Regression Models

Abstract

3.1 Linear Regression

Example 3.1

3.2 The Least Squares Method

3.2.1 Fundamentals of the Least Squares Method

3.2.1.1 Normal Equations and LS Estimate

3.2.1.2 Matrix Formulation

3.2.2 Mean Squared Error and Model Order Selection

3.2.2.1 Bias, Variance, and Mean Squared Error of the LS Estimate

3.2.2.2 Model Order Selection

Example 3.2

3.3 Ill-Conditioning

3.3.1 Ill-Conditioned Least Squares Problems

3.3.1.1 Singular Value Decomposition

3.3.1.2 Condition Number

3.3.1.3 Ill-Conditioned Matrix and LS Problem

Example 3.3

Example 3.4

3.3.1.4 LS Estimate Exploiting the SVD of \(\varPhi \)

Example 3.5

3.3.2 Ill-Conditioning in System Identification

3.4 Regularized Least Squares with Quadratic Penalties

Remark 3.1

3.4.1 Making an Ill-Conditioned LS Problem Well Conditioned

3.4.1.1 Mean Squared Error

3.4.2 Equivalent Degrees of Freedom

Example 3.6

3.4.2.1 Regularization Design: The Optimal Regularizer

Lemma 3.1

Proposition 3.1

Proposition 3.2

3.5 Regularization Tuning for Quadratic Penalties

3.5.1 Mean Squared Error and Expected Validation Error

3.5.1.1 Minimizing the MSE

3.5.1.2 Minimizing the EVE

Remark 3.2

3.5.2 Efficient Sample Reuse

3.5.2.1 Hold Out Cross-Validation

3.5.2.2 k-Fold Cross-Validation

3.5.2.3 Predicted Residual Error Sum of Squares and Variants

3.5.3 Expected In-Sample Validation Error

3.5.3.1 Expectation of the Sum of Squared Residuals, Optimism and Degrees of Freedom

Theorem 3.7

3.5.3.2 An Unbiased Estimator of the Expected In-Sample Validation Error

3.5.3.3 Excess Degrees of Freedom*

Lemma 3.2

3.6 Regularized Least Squares with Other Types of Regularizers \(\star \)

3.6.1 \(\ell _1\)-Norm Regularization

3.6.1.1 Computation of Sparse Solutions

3.6.1.2 LASSO Using an Orthogonal Regression Matrix

3.6.1.3 LASSO Using a Generic Regression Matrix: Geometric Interpretation

Example 3.7

3.6.1.4 Sparsity Inducing Regularizers Beyond the \(\ell _1\)-Norm

3.6.1.5 Presence of Outliers and Robust Regression

3.6.1.6 An Equivalence Between \(\ell _1\)-Norm Regularization and Huber Estimation

3.6.2 Nuclear Norm Regularization

3.6.2.1 Nuclear Norm Regularization for Matrix Rank Minimization

3.6.2.2 Application in Covariance Matrix Estimation with Low-Rank Structure

Example 3.8

3.6.2.3 Vector Case: \(\ell _1\)-Norm Regularization

3.7 Further Topics and Advanced Reading

3.8 Appendix

3.8.1 Fundamentals of Linear Algebra

3.8.1.1 QR Factorization and Singular Value Decomposition

3.8.1.2 Vector and Matrix Norms

3.8.1.3 Matrix Inversion Lemma, Based on [49]

3.8.2 Proof of Lemma 3.1

3.8.3 Derivation of Predicted Residual Error Sum of Squares (PRESS)

3.8.4 Proof of Theorem 3.7

Lemma 3.3

Proof

3.8.5 A Variant of the Expected In-Sample Validation Error and Its Unbiased Estimator

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

Copyright information