## Abstract

Adopting a quadratic loss, the performance of an estimator can be measured in terms of its mean squared error which decomposes into a variance and a bias component. This introductory chapter contains two linear regression examples which describe the importance of designing estimators able to well balance these two components. The first example will deal with estimation of the means of independent Gaussians. We will review the classical least squares approach which, at first sight, could appear the most appropriate solution to the problem. Remarkably, we will instead see that this unbiased approach can be dominated by a particular biased estimator, the so-called James–Stein estimator. Within this book, this represents the first example of regularized least squares, an estimator which will play a key role in subsequent chapters. The second example will deal with a classical system identification problem: impulse response estimation. A simple numerical experiment will show how the variance of least squares can be too large, hence leading to unacceptable system reconstructions. The use of an approach, known as ridge regression, will give first simple intuitions on the usefulness of regularization in the system identification scenario.

Download chapter PDF

## 1.1 The Stein Effect

Consider the following “basic” statistical problem. Starting from the realizations of *N* independent Gaussian random variables \(y_i \sim \mathscr {N}(\theta _i,\sigma ^2)\), our aim is to reconstruct the means \(\theta _i\), contained in the vector \(\theta \) seen as a deterministic but unknown parameter vector.^{Footnote 1} The estimation performance will be measured in terms of mean squared error (MSE). In particular, let \(\mathscr {E}\) and \(\Vert \cdot \Vert \) denote expectation and Euclidean norm, respectively. Then, given an estimator \(\hat{\theta }\) of an *N*-dimensional vector \(\theta \) with *i*th component \(\theta _i\), one has

where in the last passage we have decomposed the error into two components. The first one is the *variance* of the estimator while the difference between the mean and the true parameter values measures the *bias*. If the mean coincides with \(\theta \), the estimator is said to be *unbiased*. The total error thus has two contributions: the variance and the (squared) bias.

Note that the mean estimation problem introduced above is a simple instance of linear Gaussian regression. In fact, letting \(I_N\) be the \(N \times N\) identity matrix, the measurements model is

where *Y* is the *N*-dimensional (column) vector with *i*th component \(y_i\). The most popular strategy to recover \(\theta \) from data is least squares which also corresponds to maximum likelihood in this Gaussian scenario. The solution minimizes

and is then given by

Apparently, the obtained estimator is the most reasonable one. A first intuitive argument supporting it is the fact that the random variables \(\{y_j\}_{ j \ne i }\) seem unable to carry any information on \(\theta _i\), since all the noises \(e_i\) are independent. Hence, the natural estimate of \(\theta _i\) appears indeed its noisy observation \(y_i\). This estimator is also unbiased: for any \(\theta \) we have

Hence, from (1.1) we see that the MSE coincides with its variance, which is constant over \(\theta \) and given by

According to Markov’s theorem \(\hat{\theta }^{LS}\) is also efficient. This means that its variance is equal to the Cramér–Rao limit: no unbiased estimate can be better than the least squares estimate, e.g., see [9, 17].

### 1.1.1 The James–Stein Estimator

By introducing some bias in the inference process, it is easy to obtain estimators which dominate strictly least squares (in the MSE sense) over certain parameter regions. The most trivial example is the constant estimator \(\hat{\theta } = a\). Its variance is null, so that its MSE reduces to the bias component \(\Vert \theta -a\Vert ^2\). Hence, even if the behaviour of \(\hat{\theta }\) is unacceptable in most of the parameter space, this estimator outperforms least squares in the region

Note a feature common to least squares and the constant estimator. Both of them do not attempt to trade bias and variance, they just set to zero one of the two MSE components in (1.1). An alternative route is the design of estimators which try to balance bias and variance. Rather surprisingly, we will now see that this strategy can dominate \(\hat{\theta }^{LS}\) over the entire parameter space.

The first criticisms about least squares were introduced by Stein in the ’50s [23] and can be so summarized. A good mean estimator \(\hat{\theta }\) should also lead to a good estimate of the Euclidean norm of \(\theta \). Thus, one should have

But, if we consider the “natural” estimator \(\hat{\theta }^{LS}=Y\), in view of the independence of the errors \( e_i\), one obtains

This shows that the least squares estimator tends to overestimate \(\Vert \theta \Vert \). It thus seems desirable to correct \(\hat{\theta }^{LS}\) by shrinking the estimate towards the origin, e.g., adopting estimators of the form \(\hat{\theta }^{LS}(1-r)\), where *r* is a positive scalar. The most famous example is the James–Stein estimator [15] where *r* is determined from data as follows:

hence leading to

Note that, even if all the components of *Y* are mutually independent, \(\hat{\theta }^{JS}\) exploits all of them to estimate each \(\theta _i\). The surprising outcome is that \(\hat{\theta }^{JS}\) outperforms \(\hat{\theta }^{LS}\) over all the parameter space, as illustrated in the next theorem.

### Theorem 1.1

(James–Stein’s MSE, based on [15]) Consider
*N* Gaussian and independent random variables \(y_i \sim \mathscr {N}(\theta _i,\sigma ^2)\). Let also \(\hat{\theta }^{JS}\) denote the James–Stein estimator of the means, i.e.,

Then, if \(N \ge 3\), the MSE of \(\hat{\theta }^{JS}\) satisfies

We say that an estimator dominates another estimator if for all the \(\theta \) its MSE is not larger and for some \(\theta \) it is smaller. In statistics an estimator is then said to be *admissible* if no other estimator exists that dominates it in terms of MSE. The above theorem then shows that the least squares estimator of the mean of a multivariate Gaussian is not admissible if the dimension exceeds two. The reason is that, even when the Gaussians are independent, the global MSE can be reduced uniformly by adding some bias to the estimate. This is also graphically illustrated in Fig. 1.1 where \(MSE_{JS}\), along with its decomposition, is plotted as a function of the component \(\theta _1\) of the ten-dimensional vector \(\theta =[\theta _1 \ 0 \ldots 0]\) (noise variance is equal to one). One can see that \(MSE_{JS}<MSE_{LS}\) since the bias introduced by \(\hat{\theta }^{JS}\) is compensated by a greater reduction in the variance of the estimate. Note however that James–Stein improves the overall MSE and not the individual errors affecting the \(\theta _i\). This aspect can be important in certain applications where it is not desirable to trade a higher individual MSE for a smaller overall MSE.

It is easy to check that the James–Stein estimator admits the following interesting reformulation:

where the positive scalar \(\gamma \) is determined from data as follows:

Equation (1.3) thus reveals that \(\hat{\theta }^{JS}\) is a particular version of *regularized least squares*, an estimator which will play a central role in this book. In particular, the objective in (1.3) contains two contrasting terms. The first one, \(\Vert Y- \theta \Vert ^2\), is a quadratic loss which measures the adherence to experimental data. The second one, \(\Vert \theta \Vert ^2\), is a regularizer which shrinks the estimate towards the origin by penalizing the energy of the solution. The role of the regularization parameter \(\gamma \) is then to balance these two components via a simple scalar adjustment. Equation (1.4) shows that James–Stein’s strategy is to set its value to the inverse of an estimate of the signal-to-noise ratio.

### 1.1.2 Extensions of the James–Stein Estimator \(\star \)

We have seen that the James–Stein estimator corrects each component of \(\hat{\theta }^{LS}\) shifting it towards the origin. This implies that the MSE improvement will be better when the components of \(\theta \) are close to zero. Actually, there is nothing special in the origin. If the true \(\theta \) is expected to be close to \(a \in {\mathbb R}^N\), one can modify the original \(\hat{\theta }^{JS}\) as follows:

The result is an estimator which still dominates least squares, with the origin’s role now played by *a*. The estimator thus concentrates the MSE improvement around *a*.

Now, let us consider a non-orthonormal scenario where Gaussian linear regression now amounts to estimating \(\theta \) from the *N* measurements

with all the noises \(e_i\) mutually independent. The least squares (maximum likelihood) estimator is now

and its MSE is the sum of the variances of \(\hat{\theta }^{LS}_i\), i.e.,

Note that the MSE can be large when just one of the \(d_i\) is small. In this case, the problem is said to be *ill-conditioned*: even a moderate measurement error can lead to a large reconstruction error.

Also in this non-orthonormal scenario, it is possible to design estimators whose MSE is uniformly smaller than \(MSE_{LS}\). The number of possible choices is huge, depending on which region of the parameter space one wants to concentrate the improvement. There is however an important limitation shared by all of Stein-type estimators: in general they are not much effective against ill-conditioning. This is illustrated in the following example. It illustrates an estimator whose negative features are well representative of some drawbacks of Stein’s estimation in non-orthogonal settings.

### Example 1.2

(*A generalization of James–Stein*) Consider the estimator \(\hat{\theta }\) whose *i*th component is given by

where

It is now shown that \(\hat{\theta }\) is a generalization of James–Stein able to outperform least squares over the entire parameter space. In fact, defining

after simple computations we obtain

where the last equality comes from Lemma 1.1 reported in Sect. 1.4. Since

one has

which implies

However, assume that the problem is ill-conditioned. Then, if one \(d_i\) is small and the values of \(d_i\) are quite spread, we could well have \(d_i^2 /S \approx 0\). Hence, (1.5) essentially reduces to

which is the least squares estimate of \(\theta _i\). This means that the signal components mostly influenced by the noise, i.e., associated with small \(d_i\), are not regularized. Thus, in presence of ill-conditioning, \(\hat{\theta }\) will likely return an estimate affected by large errors. \(\square \)

## 1.2 Ridge Regression

Consider now one of the fundamental problems in system identification. The task is to estimate the impulse response \(g^0\) of a discrete-time, linear and causal dynamic system, starting from noisy output data. The measurements model is

where *t* denotes time, the sampling interval is one time unit for simplicity, the \(g_k^0\) indicate the impulse response coefficients, *u*(*t*) is the known system input while *e*(*t*) is the noise.

To determine the impulse response from input–output measurements, one of the main questions is how to parametrize the unknown \(g^0\). The classical approach, which will be also reviewed in the next chapter, introduces a collection of impulse response models \(g(\theta )\), each parametrized by a different vector \(\theta \). In particular, here we will adopt an FIR model of order *m*, i.e., \(g_k(\theta )=\theta _k\) for \(k=1, \ldots ,m\) and zero elsewhere. This permits to reformulate (1.6) as a linear regression: we stack all the elements *y*(*t*) and *e*(*t*) to form the vectors *Y* and *E* and obtain the model

with the regression matrix \(\varPhi \in {\mathbb R}^{N \times m}\) given by

We can now use least squares to estimate \(\theta \). Assuming \(\varPhi ^T\varPhi \) of full rank, we obtain

Note that the impulse response estimate is function of the FIR order which corresponds to the dimension *m* of \(\theta \). The choice of *m* is a trade-off between bias (a large *m* is needed to describe slowly decaying impulse responses without too much error) and variance (large *m* requires estimation of many parameters leading to large variance). This can be illustrated with a numerical experiment. The unknown impulse response \(g^0\) is defined by the following rational transfer function:

which, in practice, is equal to zero after less than 50 samples (\(g^0\) is the red line in Fig. 1.3). We estimate the system from 1000 outputs corrupted by white and Gaussian noises *e*(*t*) of variance equal to the variance of the noiseless output divided by 50, see Fig. 1.2 (bottom panel). Data come from the system initially at rest and then fed at \(t=0\) with white noise low-pass filtered by \(z/(z-0.99)\), see Fig. 1.2 (top panel). The reconstruction error is very large if we try to estimate \(g^0\) with \(m=50\): linear models are easy to estimate but the drawback is that high-order FIR may suffer from high variance. Hence, it is important to select a model order which well balances bias and variance. To do that one needs to try different values of *m* then using some validation procedures to determine the “optimal” one. In this case, since the true \(g^0\) is known, we can obtain the best value by selecting that \(m \in [1,\ldots ,50]\) which minimizes the MSE. This is an example of *oracle-based* procedure not implementable in practice: the optimal order is selected exploiting the knowledge of the true system. We obtain \(m=18\) which corresponds to \(MSE_{LS}=70.7\) and leads to the impulse response estimate displayed in Fig. 1.3. Even if the data set size is large and the signal-to-noise ratio is good, the estimate is far from satisfactory. The reason is that the low-pass input has poor excitation and leads to an ill-conditioned problem. This means that the condition number of the regression matrix \(\varPhi \) is large so that also a small output error can produce a large reconstruction error.

An alternative to the classical paradigm, where different model structures are introduced, is the following straightforward generalization of (1.3), known as *ridge regression* [13, 14]:

where we set \(m=50\) to solve our problem. Letting \(A=(\varPhi ^T\varPhi +\gamma I_{m})^{-1}\varPhi ^T\), it is easy to derive the MSE decomposition associated with \(\hat{\theta }^{R}\):

Figure 1.4 displays \(MSE_{R}\) for the particular system identification problem at hand as a function of the regularization parameter. Note that \(\gamma \) plays the role of the model order in the classical scenario but can be tuned in a continuous manner to reach a good bias-variance trade-off. It is also interesting to see its influence on the variance and bias components. The variance is a decreasing function of the regularization parameter. Hence, its maximum is reached for \(\gamma =0\) where \(\hat{\theta }^{R}\) reduces to the least squares estimator \(\hat{\theta }^{LS}\) given by (1.7) with \(m=50\). Instead, the bias increases with \(\gamma \). At the limit, for \(\gamma \rightarrow \infty \), the penalty \(\Vert \theta \Vert ^2\) is so overweighted that \(\hat{\theta }^{R}\) becomes the constant estimator centred on the origin (it returns all null impulse response coefficients).

In Fig. 1.5, we finally display the ridge regularized estimate with \(\gamma \) set to the value minimizing the error and leading to \(MSE_{R}=16.8\). It is evident that ridge regression provides a much better bias-variance trade-off than selecting the FIR order.

## 1.3 Further Topics and Advanced Reading

Stein’s intuition on the development of an estimator able to dominate least squares in terms of global MSE can be found in [23], while the specific shape of \(\hat{\theta }^{JS}\) has been obtained in [15]. From then, a large variety of different estimators outperforming least squares, also under different losses, have been designed. It has been proved that there exists estimators which dominate James–Stein, even if the MSE improvement is not large, as described in [12, 16, 25]. Extensions and applications can be found in [5, 6, 11, 22, 24, 26]. A James–Stein version of the Kalman filter is derived in [18]. For interesting discussions on the limitations of Stein-type estimators in facing ill-conditioning see [8] but also [19] for new outcomes with better numerical stability properties. Other developments are reported in [7] where generalizations of Stein’s lemma are also described.

The paper [10] describes connections between James–Stein estimation and the so-called empirical Bayes approaches which will be treated later on in this book. The interplay between Stein-type estimators and the Bayes approach is also discussed in [2]. Here, one can also find an estimator which dominates least squares concentrating the MSE improvement in an ellipsoid that can be chosen by the user in the parameter space. This approach is deeply connected with robust Bayesian estimation concepts, e.g., see [1, 3].

The term ridge regression has been popularized by the works [13, 14]. This approach, introduced to guard against ill-conditioning and numerical instability, is an example of Tikhonov regularization for ill-posed problems. Among the first classical works on regularization and inverse problems, it is worth already citing [4, 20, 27,28,29]. A recent survey on the use of regularization for system identification can be instead found in [21]. The literature on this topic is huge and other relevant works will be cited in the next chapters.

## 1.4 Appendix: Proof of Theorem 1.1

To discuss the properties of the James–Stein estimator, first it is useful to introduce a result which is a simplified version of Lemma 3.2 reported in Chap. 3, known as Stein’s lemma.

### Lemma 1.1

(Stein’s lemma, based on [24]) Consider *N* Gaussian and independent random variables \(y_i \sim \mathscr {N}(\theta _i,\sigma ^2)\). For \(i=1,\ldots ,n\), let also \(h:{\mathbb R}^N \rightarrow {\mathbb R}\) be a differentiable function such that \(\mathscr {E} \left| \frac{\partial h (Y) }{\partial y_i} \right| < \infty \). Then, it holds that

### Proof

During the proof, we use \(\mathscr {E}_{j \ne i}\) to denote the expectation conditional on \(\{y_j\}_{ j \ne i}\). Also, abusing notation, *h*(*x*) with \(x \in {\mathbb R}\) indicates the function *h* with *i*th argument set to *x* while the other arguments are set to \(y_j \ j \ne i\).

Note that, in view of the independence assumptions, each \(y_i\) conditional on \(\{y_j\}_{ j \ne i}\) is still Gaussian with mean \(\theta _i\) and variance \(\sigma ^2\). Then, using integration by parts, one has

Note that the penultimate equality exploits the fact that \(h(x) \exp (-(x-\theta _i)^2/(2\sigma ^2))\) must be infinitesimal as \(x \rightarrow \infty \), otherwise the assumption \(\mathscr {E} \left| \frac{\partial h (Y) }{\partial y_i} \right| < \infty \) would not hold. Using the above result, we obtain

and this completes the proof. \(\square \)

We now show that the MSE of the James–Stein estimator is uniformly smaller than the MSE of least squares. One has

As for the last term inside the expectation, exploiting Stein’s lemma with

one has

Using this equality in the MSE expression, we finally obtain

## Notes

- 1.
In future chapters, \(\theta _0\) will be used to denote the true value of the deterministic vector that has generated the data, distinguishing it from the vector which parametrizes the model. In this introductory chapter, \(\theta \) is instead used in both the cases to maintain the notation as simple as possible.

## References

Berger JO (1980) A robust generalized Bayes estimator and confidence region for a multivariate normal mean. Ann Stat 8:716–761

Berger JO (1982) Selecting a minimax estimator of a multivariate normal mean. Ann Stat 10:81–92

Berger JO (1994) An overview of robust Bayesian analysis. Test 3:5–124

Bertero M (1989) Linear inverse and ill-posed problems. Adv Electron Electron Phys 75:1–120

Bhattacharya PK (1966) Estimating the mean of a multivariate normal population with general quadratic loss function. Ann Math Stat 37:1819–1824

Bock ME (1975) Minimax estimators of the mean of a multivariate normal distribution. Ann Stat 3:209–218

Brandwein AC, Strawderman WE (2012) Stein estimation for spherically symmetric distributions: recent developments. Stat Sci 27:11–23

Casella G (1980) Minimax ridge regression estimation. Ann Stat 8:1036–1056

Casella G, Berger R (2001) Statistical inference. Cengage Learning

Efron B, Morris C (1973) Stein’s estimation rule and its competitors - an empirical Bayes approach. J Am Stat Assoc 68(341):117–130

Greenberg E, Webster CE (1983) Advanced econometrics: a bridge to the literature. Wiley

Guo YY, Pal N (1992) A sequence of improvements over the James–Stein estimator. J Multivar Anal 42:302–317

Hoerl AE (1962) Application of ridge analysis to regression problems. Chem Eng Prog 58:54–59

Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12:55–67

James W, Stein C (1961) Estimation with quadratic loss. In: Proceedings of the 4th Berkeley symposium on mathematical statistics and probability, vol. I. University of California Press, pp 361–379

Kubokawa T (1991) An approach to improving the James–Stein estimator. J Multivar Anal 36:121–126

Ljung L (1999) System identification - theory for the user, 2nd edn. Prentice-Hall, Upper Saddle River

Manton JH, Krishnamurthy V, Poor HV (1998) James–Stein state filtering algorithms. IEEE Trans Signal Process 46(9):2431–2447

Maruyama Y, Strawderman WE (2005) A new class of generalized Bayes minimax ridge regression estimators. Ann Stat 1753–1770

Phillips DL (1962) A technique for the numerical solution of certain integral equations of the first kind. J Assoc Comput Mach 9:84–97

Pillonetto G, Dinuzzo F, Chen T, De Nicolao G, Ljung L (2014) Kernel methods in system identification, machine learning and function estimation: a survey. Automatica 50

Shinozaki N (1974) A note on estimating the mean vector of a multivariate normal distribution with general quadratic loss function. Keio Eng Rep 27:105–112

Stein C (1956) Inadmissibility of the usual estimator for the mean of a multivariate distribution. In: Proceedings of the 3rd Berkeley symposium on mathematical statistics and probability, vol I. University of California Press, pp 197–206

Stein C (1981) Estimation of the mean of a multivariate normal distribution. Ann Stat 9:1135–1151

Strawderman WE (1971) Proper Bayes minimax estimators of the multivariate normal mean. Ann Math Stat 42:385–388

Strawderman WE (1978) Minimax adaptive generalized ridge regression estimators. J Am Stat Assoc 73:623–627

Tikhonov AN, Arsenin VY (1977) Solutions of ill-posed problems. Winston/Wiley, Washington, D.C

Tikhonov AN (1943) On the stability of inverse problems. Dokl Akad Nauk SSSR 39:195–198

Tikhonov AN (1963) On the solution of incorrectly formulated problems and the regularization method. Dokl Akad Nauk SSSR 151:501–504

## Author information

### Authors and Affiliations

### Corresponding author

## Rights and permissions

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Copyright information

© 2022 The Author(s)

## About this chapter

### Cite this chapter

Pillonetto, G., Chen, T., Chiuso, A., De Nicolao, G., Ljung, L. (2022). Bias. In: Regularized System Identification. Communications and Control Engineering. Springer, Cham. https://doi.org/10.1007/978-3-030-95860-2_1

### Download citation

DOI: https://doi.org/10.1007/978-3-030-95860-2_1

Published:

Publisher Name: Springer, Cham

Print ISBN: 978-3-030-95859-6

Online ISBN: 978-3-030-95860-2

eBook Packages: EngineeringEngineering (R0)