Bias

Pillonetto, Gianluigi; Chen, Tianshi; Chiuso, Alessandro; De Nicolao, Giuseppe; Ljung, Lennart

doi:10.1007/978-3-030-95860-2_1

Gianluigi Pillonetto¹⁰,
Tianshi Chen¹¹,
Alessandro Chiuso¹⁰,
Giuseppe De Nicolao¹² &
…
Lennart Ljung¹³

Part of the book series: Communications and Control Engineering ((CCE))

4572 Accesses

Abstract

Adopting a quadratic loss, the performance of an estimator can be measured in terms of its mean squared error which decomposes into a variance and a bias component. This introductory chapter contains two linear regression examples which describe the importance of designing estimators able to well balance these two components. The first example will deal with estimation of the means of independent Gaussians. We will review the classical least squares approach which, at first sight, could appear the most appropriate solution to the problem. Remarkably, we will instead see that this unbiased approach can be dominated by a particular biased estimator, the so-called James–Stein estimator. Within this book, this represents the first example of regularized least squares, an estimator which will play a key role in subsequent chapters. The second example will deal with a classical system identification problem: impulse response estimation. A simple numerical experiment will show how the variance of least squares can be too large, hence leading to unacceptable system reconstructions. The use of an approach, known as ridge regression, will give first simple intuitions on the usefulness of regularization in the system identification scenario.

You have full access to this open access chapter, Download chapter PDF

1.1 The Stein Effect

Consider the following “basic” statistical problem. Starting from the realizations of N independent Gaussian random variables $y_i \sim \mathscr {N}(\theta _i,\sigma ^2)$, our aim is to reconstruct the means $\theta _i$, contained in the vector $\theta $ seen as a deterministic but unknown parameter vector.^{Footnote 1} The estimation performance will be measured in terms of mean squared error (MSE). In particular, let $\mathscr {E}$ and $\Vert \cdot \Vert $ denote expectation and Euclidean norm, respectively. Then, given an estimator $\hat{\theta }$ of an N-dimensional vector $\theta $ with ith component $\theta _i$, one has

$$\begin{aligned} MSE_{\hat{\theta }}= & {} \mathscr {E} \Vert \hat{\theta } - \theta \Vert ^2 \nonumber \\= & {} \underbrace{\sum _{i=1}^N \mathscr {E} (\hat{\theta }_i - \mathscr {E} \hat{\theta }_i )^2}_{Variance} + \underbrace{\sum _{i=1}^N (\theta _i - \mathscr {E} \hat{\theta }_i )^2}_{Bias^2}, \end{aligned}$$

(1.1)

where in the last passage we have decomposed the error into two components. The first one is the variance of the estimator while the difference between the mean and the true parameter values measures the bias. If the mean coincides with $\theta $, the estimator is said to be unbiased. The total error thus has two contributions: the variance and the (squared) bias.

Note that the mean estimation problem introduced above is a simple instance of linear Gaussian regression. In fact, letting $I_N$ be the $N \times N$ identity matrix, the measurements model is

$$\begin{aligned} Y = \theta + E, \quad E \sim \mathscr {N}(0,\sigma ^2 I_N), \end{aligned}$$

(1.2)

where Y is the N-dimensional (column) vector with ith component $y_i$. The most popular strategy to recover $\theta $ from data is least squares which also corresponds to maximum likelihood in this Gaussian scenario. The solution minimizes

$$ \Vert Y - \theta \Vert ^2 $$

and is then given by

$$ \hat{\theta }^{LS} = Y. $$

Apparently, the obtained estimator is the most reasonable one. A first intuitive argument supporting it is the fact that the random variables $\{y_j\}_{ j \ne i }$ seem unable to carry any information on $\theta _i$, since all the noises $e_i$ are independent. Hence, the natural estimate of $\theta _i$ appears indeed its noisy observation $y_i$. This estimator is also unbiased: for any $\theta $ we have

$$ \mathscr {E} \left( \hat{\theta }^{LS}\right) = \mathscr {E} \left( Y \right) = \theta . $$

Hence, from (1.1) we see that the MSE coincides with its variance, which is constant over $\theta $ and given by

$$ MSE_{LS} = \mathscr {E} \Vert \hat{\theta }^{LS} - \theta \Vert ^2 = N\sigma ^2. $$

According to Markov’s theorem $\hat{\theta }^{LS}$ is also efficient. This means that its variance is equal to the Cramér–Rao limit: no unbiased estimate can be better than the least squares estimate, e.g., see [9, 17].

1.1.1 The James–Stein Estimator

By introducing some bias in the inference process, it is easy to obtain estimators which dominate strictly least squares (in the MSE sense) over certain parameter regions. The most trivial example is the constant estimator $\hat{\theta } = a$. Its variance is null, so that its MSE reduces to the bias component $\Vert \theta -a\Vert ^2$. Hence, even if the behaviour of $\hat{\theta }$ is unacceptable in most of the parameter space, this estimator outperforms least squares in the region

$$ \{\theta \ \text {s.t.} \ \Vert \theta -a\Vert ^2<N\sigma ^2 \}. $$

Note a feature common to least squares and the constant estimator. Both of them do not attempt to trade bias and variance, they just set to zero one of the two MSE components in (1.1). An alternative route is the design of estimators which try to balance bias and variance. Rather surprisingly, we will now see that this strategy can dominate $\hat{\theta }^{LS}$ over the entire parameter space.

The first criticisms about least squares were introduced by Stein in the ’50s [23] and can be so summarized. A good mean estimator $\hat{\theta }$ should also lead to a good estimate of the Euclidean norm of $\theta $. Thus, one should have

$$ \Vert \hat{\theta } \Vert \approx \Vert \theta \Vert . $$

But, if we consider the “natural” estimator $\hat{\theta }^{LS}=Y$, in view of the independence of the errors $ e_i$, one obtains

$$ \mathscr {E} \Vert Y \Vert ^2 = N\sigma ^2 + \Vert \theta \Vert ^2. $$

This shows that the least squares estimator tends to overestimate $\Vert \theta \Vert $. It thus seems desirable to correct $\hat{\theta }^{LS}$ by shrinking the estimate towards the origin, e.g., adopting estimators of the form $\hat{\theta }^{LS}(1-r)$, where r is a positive scalar. The most famous example is the James–Stein estimator [15] where r is determined from data as follows:

$$ r=\frac{(N-2)\sigma ^2}{\Vert Y \Vert ^2}, $$

hence leading to

$$ \hat{\theta }^{JS} = Y - \frac{(N-2)\sigma ^2}{\Vert Y \Vert ^2} Y. $$

Note that, even if all the components of Y are mutually independent, $\hat{\theta }^{JS}$ exploits all of them to estimate each $\theta _i$. The surprising outcome is that $\hat{\theta }^{JS}$ outperforms $\hat{\theta }^{LS}$ over all the parameter space, as illustrated in the next theorem.

Theorem 1.1

(James–Stein’s MSE, based on [15]) Consider N Gaussian and independent random variables $y_i \sim \mathscr {N}(\theta _i,\sigma ^2)$. Let also $\hat{\theta }^{JS}$ denote the James–Stein estimator of the means, i.e.,

$$ \hat{\theta }^{JS} = Y - \frac{(N-2)\sigma ^2}{\Vert Y \Vert ^2} Y. $$

Then, if $N \ge 3$, the MSE of $\hat{\theta }^{JS}$ satisfies

$$ MSE_{JS} < N\sigma ^2 \quad \forall \theta . $$

We say that an estimator dominates another estimator if for all the $\theta $ its MSE is not larger and for some $\theta $ it is smaller. In statistics an estimator is then said to be admissible if no other estimator exists that dominates it in terms of MSE. The above theorem then shows that the least squares estimator of the mean of a multivariate Gaussian is not admissible if the dimension exceeds two. The reason is that, even when the Gaussians are independent, the global MSE can be reduced uniformly by adding some bias to the estimate. This is also graphically illustrated in Fig. 1.1 where $MSE_{JS}$, along with its decomposition, is plotted as a function of the component $\theta _1$ of the ten-dimensional vector $\theta =[\theta _1 \ 0 \ldots 0]$ (noise variance is equal to one). One can see that $MSE_{JS}<MSE_{LS}$ since the bias introduced by $\hat{\theta }^{JS}$ is compensated by a greater reduction in the variance of the estimate. Note however that James–Stein improves the overall MSE and not the individual errors affecting the $\theta _i$. This aspect can be important in certain applications where it is not desirable to trade a higher individual MSE for a smaller overall MSE.

It is easy to check that the James–Stein estimator admits the following interesting reformulation:

$$\begin{aligned} \begin{aligned} \hat{\theta }^{JS}&= \arg \min _\theta \Vert Y- \theta \Vert ^2 + \gamma \ \Vert \theta \Vert ^2\\&= Y \frac{1}{1 + \gamma }, \end{aligned} \end{aligned}$$

(1.3)

where the positive scalar $\gamma $ is determined from data as follows:

$$\begin{aligned} \gamma = \frac{(N-2)\sigma ^2}{\Vert Y \Vert ^2-(N-2)\sigma ^2}. \end{aligned}$$

(1.4)

Equation (1.3) thus reveals that $\hat{\theta }^{JS}$ is a particular version of regularized least squares, an estimator which will play a central role in this book. In particular, the objective in (1.3) contains two contrasting terms. The first one, $\Vert Y- \theta \Vert ^2$, is a quadratic loss which measures the adherence to experimental data. The second one, $\Vert \theta \Vert ^2$, is a regularizer which shrinks the estimate towards the origin by penalizing the energy of the solution. The role of the regularization parameter $\gamma $ is then to balance these two components via a simple scalar adjustment. Equation (1.4) shows that James–Stein’s strategy is to set its value to the inverse of an estimate of the signal-to-noise ratio.

1.1.2 Extensions of the James–Stein Estimator $\star $

We have seen that the James–Stein estimator corrects each component of $\hat{\theta }^{LS}$ shifting it towards the origin. This implies that the MSE improvement will be better when the components of $\theta $ are close to zero. Actually, there is nothing special in the origin. If the true $\theta $ is expected to be close to $a \in {\mathbb R}^N$, one can modify the original $\hat{\theta }^{JS}$ as follows:

$$ \hat{\theta }^{JS} = Y - \frac{(N-2)\sigma ^2}{\Vert Y - a \Vert ^2} \left( Y - a\right) . $$

The result is an estimator which still dominates least squares, with the origin’s role now played by a. The estimator thus concentrates the MSE improvement around a.

Now, let us consider a non-orthonormal scenario where Gaussian linear regression now amounts to estimating $\theta $ from the N measurements

$$ y_i = d_i \theta _i + e_i \quad e_i \sim \mathscr {N}(0,1), $$

with all the noises $e_i$ mutually independent. The least squares (maximum likelihood) estimator is now

$$ \hat{\theta }^{LS}_i = \frac{y_i}{d_i}, \quad i=1,\ldots ,N, $$

and its MSE is the sum of the variances of $\hat{\theta }^{LS}_i$, i.e.,

$$ MSE_{LS} = \sum _{i=1}^N \frac{1}{d_i^2}. $$

Note that the MSE can be large when just one of the $d_i$ is small. In this case, the problem is said to be ill-conditioned: even a moderate measurement error can lead to a large reconstruction error.

Also in this non-orthonormal scenario, it is possible to design estimators whose MSE is uniformly smaller than $MSE_{LS}$. The number of possible choices is huge, depending on which region of the parameter space one wants to concentrate the improvement. There is however an important limitation shared by all of Stein-type estimators: in general they are not much effective against ill-conditioning. This is illustrated in the following example. It illustrates an estimator whose negative features are well representative of some drawbacks of Stein’s estimation in non-orthogonal settings.

Example 1.2

(A generalization of James–Stein) Consider the estimator $\hat{\theta }$ whose ith component is given by

$$\begin{aligned} \hat{\theta }_i = \left[ 1 - \frac{N-2}{S}d_i^2 \right] \frac{y_i}{d_i}, \quad i=1,\ldots ,N, \end{aligned}$$

(1.5)

where

$$ S = \sum _{i=1}^N d_i^2 y_i^2. $$

It is now shown that $\hat{\theta }$ is a generalization of James–Stein able to outperform least squares over the entire parameter space. In fact, defining

$$ h_i(Y) = -d_i^2 \frac{N-2}{S}y_i, $$

after simple computations we obtain

$$\begin{aligned} MSE_{\hat{\theta }}= & {} \sum _{i=1}^N \frac{1}{d_i^2} + \mathscr {E} \left[ 2 \sum _{i=1}^N \frac{(y_i -d_i \theta _i)h_i(Y)}{d_i^2} + \sum _{i=1}^N \frac{h_i^2(Y)}{d_i^2} \right] \\= & {} \sum _{i=1}^N \frac{1}{d_i^2} + \mathscr {E} \left[ 2 \sum _{i=1}^N \frac{1}{d_i^2} \frac{\partial h_i (Y)}{\partial y_i} + \sum _{i=1}^N \frac{h_i^2(Y)}{d_i^2} \right] , \end{aligned}$$

where the last equality comes from Lemma 1.1 reported in Sect. 1.4. Since

$$ \frac{\partial h_i (Y)}{\partial y_i} = -d_i^2 \frac{N-2}{S}+ d_i^4\frac{N-2}{S^2}2 y_i^2, $$

one has

$$\begin{aligned}&\mathscr {E} \left[ 2 \sum _{i=1}^N \frac{1}{d_i^2} \frac{\partial h_i (Y)}{\partial y_i} + \sum _{i=1}^N \frac{h_i^2(Y)}{d_i^2} \right] \\= & {} \mathscr {E} \left[ - \frac{2(N-2)N}{S} + \frac{2(N-2)}{S^2} \sum _{i=1}^N 2d_i^2 y_i^2 + \frac{(N-2)^2}{S^2} \sum _{i=1}^N d_i^2 y_i^2 \right] \\= & {} - \mathscr {E} \frac{(N-2)^2}{S} < 0 \end{aligned}$$

which implies

$$ MSE_{\hat{\theta }} < MSE_{\hat{\theta }^{LS}} \ \ \forall \theta . $$

However, assume that the problem is ill-conditioned. Then, if one $d_i$ is small and the values of $d_i$ are quite spread, we could well have $d_i^2 /S \approx 0$. Hence, (1.5) essentially reduces to

$$ \hat{\theta }_i = \left[ 1 - \frac{N-2}{S}d_i^2 \right] \frac{y_i}{d_i} \approx \frac{y_i}{d_i}, $$

which is the least squares estimate of $\theta _i$. This means that the signal components mostly influenced by the noise, i.e., associated with small $d_i$, are not regularized. Thus, in presence of ill-conditioning, $\hat{\theta }$ will likely return an estimate affected by large errors. $\square $

1.2 Ridge Regression

Consider now one of the fundamental problems in system identification. The task is to estimate the impulse response $g^0$ of a discrete-time, linear and causal dynamic system, starting from noisy output data. The measurements model is

$$\begin{aligned} y(t) = \sum _{k=1}^{\infty } g_k^0 u(t-k) + e(t), \ \ t=1,\ldots ,N, \end{aligned}$$

(1.6)

where t denotes time, the sampling interval is one time unit for simplicity, the $g_k^0$ indicate the impulse response coefficients, u(t) is the known system input while e(t) is the noise.

To determine the impulse response from input–output measurements, one of the main questions is how to parametrize the unknown $g^0$. The classical approach, which will be also reviewed in the next chapter, introduces a collection of impulse response models $g(\theta )$, each parametrized by a different vector $\theta $. In particular, here we will adopt an FIR model of order m, i.e., $g_k(\theta )=\theta _k$ for $k=1, \ldots ,m$ and zero elsewhere. This permits to reformulate (1.6) as a linear regression: we stack all the elements y(t) and e(t) to form the vectors Y and E and obtain the model

$$ Y= \varPhi \theta + E $$

with the regression matrix $\varPhi \in {\mathbb R}^{N \times m}$ given by

$$\small \varPhi = \left( \begin{array}{ccccc} u(0) &{} u(-1) &{} u(-2) &{} \ldots &{} u(-m+1) \\ u(1) &{} u(0) &{} u(-1) &{} \ldots &{} u(-m) \\ \ldots &{} &{} &{} &{} \\ u(N-1) &{} u(N-2) &{} u(N-3) &{} \ldots &{} u(N-m) \end{array}\right) . $$

We can now use least squares to estimate $\theta $. Assuming $\varPhi ^T\varPhi $ of full rank, we obtain

$$\begin{aligned} \hat{\theta }^{LS}&={{\,\mathrm{arg\,min}\,}}_\theta \Vert Y-\varPhi \theta \Vert ^2\end{aligned}$$

(1.7a)

$$\begin{aligned}&=(\varPhi ^T\varPhi )^{-1}\varPhi ^TY. \end{aligned}$$

(1.7b)

Note that the impulse response estimate is function of the FIR order which corresponds to the dimension m of $\theta $. The choice of m is a trade-off between bias (a large m is needed to describe slowly decaying impulse responses without too much error) and variance (large m requires estimation of many parameters leading to large variance). This can be illustrated with a numerical experiment. The unknown impulse response $g^0$ is defined by the following rational transfer function:

$$ \frac{(z+1)^2}{z(z-0.8)(z-0.7)}, $$

which, in practice, is equal to zero after less than 50 samples ($g^0$ is the red line in Fig. 1.3). We estimate the system from 1000 outputs corrupted by white and Gaussian noises e(t) of variance equal to the variance of the noiseless output divided by 50, see Fig. 1.2 (bottom panel). Data come from the system initially at rest and then fed at $t=0$ with white noise low-pass filtered by $z/(z-0.99)$, see Fig. 1.2 (top panel). The reconstruction error is very large if we try to estimate $g^0$ with $m=50$: linear models are easy to estimate but the drawback is that high-order FIR may suffer from high variance. Hence, it is important to select a model order which well balances bias and variance. To do that one needs to try different values of m then using some validation procedures to determine the “optimal” one. In this case, since the true $g^0$ is known, we can obtain the best value by selecting that $m \in [1,\ldots ,50]$ which minimizes the MSE. This is an example of oracle-based procedure not implementable in practice: the optimal order is selected exploiting the knowledge of the true system. We obtain $m=18$ which corresponds to $MSE_{LS}=70.7$ and leads to the impulse response estimate displayed in Fig. 1.3. Even if the data set size is large and the signal-to-noise ratio is good, the estimate is far from satisfactory. The reason is that the low-pass input has poor excitation and leads to an ill-conditioned problem. This means that the condition number of the regression matrix $\varPhi $ is large so that also a small output error can produce a large reconstruction error.

An alternative to the classical paradigm, where different model structures are introduced, is the following straightforward generalization of (1.3), known as ridge regression [13, 14]:

$$\begin{aligned} \hat{\theta }^{R}&={{\,\mathrm{arg\,min}\,}}_\theta \Vert Y-\varPhi \theta \Vert ^2 + \gamma \Vert \theta \Vert ^2\end{aligned}$$

(1.8a)

$$\begin{aligned}&=(\varPhi ^T\varPhi +\gamma I_{m})^{-1}\varPhi ^TY, \end{aligned}$$

(1.8b)

where we set $m=50$ to solve our problem. Letting $A=(\varPhi ^T\varPhi +\gamma I_{m})^{-1}\varPhi ^T$, it is easy to derive the MSE decomposition associated with $\hat{\theta }^{R}$:

$$\begin{aligned} MSE_{R} = \underbrace{\sigma ^2 \text{ Trace } (AA^T)}_{Variance} + \underbrace{\Vert \theta - A \varPhi \theta \Vert ^2}_{Bias^2}. \end{aligned}$$

(1.9)

Figure 1.4 displays $MSE_{R}$ for the particular system identification problem at hand as a function of the regularization parameter. Note that $\gamma $ plays the role of the model order in the classical scenario but can be tuned in a continuous manner to reach a good bias-variance trade-off. It is also interesting to see its influence on the variance and bias components. The variance is a decreasing function of the regularization parameter. Hence, its maximum is reached for $\gamma =0$ where $\hat{\theta }^{R}$ reduces to the least squares estimator $\hat{\theta }^{LS}$ given by (1.7) with $m=50$. Instead, the bias increases with $\gamma $. At the limit, for $\gamma \rightarrow \infty $, the penalty $\Vert \theta \Vert ^2$ is so overweighted that $\hat{\theta }^{R}$ becomes the constant estimator centred on the origin (it returns all null impulse response coefficients).

In Fig. 1.5, we finally display the ridge regularized estimate with $\gamma $ set to the value minimizing the error and leading to $MSE_{R}=16.8$. It is evident that ridge regression provides a much better bias-variance trade-off than selecting the FIR order.

1.3 Further Topics and Advanced Reading

Stein’s intuition on the development of an estimator able to dominate least squares in terms of global MSE can be found in [23], while the specific shape of $\hat{\theta }^{JS}$ has been obtained in [15]. From then, a large variety of different estimators outperforming least squares, also under different losses, have been designed. It has been proved that there exists estimators which dominate James–Stein, even if the MSE improvement is not large, as described in [12, 16, 25]. Extensions and applications can be found in [5, 6, 11, 22, 24, 26]. A James–Stein version of the Kalman filter is derived in [18]. For interesting discussions on the limitations of Stein-type estimators in facing ill-conditioning see [8] but also [19] for new outcomes with better numerical stability properties. Other developments are reported in [7] where generalizations of Stein’s lemma are also described.

The paper [10] describes connections between James–Stein estimation and the so-called empirical Bayes approaches which will be treated later on in this book. The interplay between Stein-type estimators and the Bayes approach is also discussed in [2]. Here, one can also find an estimator which dominates least squares concentrating the MSE improvement in an ellipsoid that can be chosen by the user in the parameter space. This approach is deeply connected with robust Bayesian estimation concepts, e.g., see [1, 3].

The term ridge regression has been popularized by the works [13, 14]. This approach, introduced to guard against ill-conditioning and numerical instability, is an example of Tikhonov regularization for ill-posed problems. Among the first classical works on regularization and inverse problems, it is worth already citing [4, 20, 27,28,29]. A recent survey on the use of regularization for system identification can be instead found in [21]. The literature on this topic is huge and other relevant works will be cited in the next chapters.

1.4 Appendix: Proof of Theorem 1.1

To discuss the properties of the James–Stein estimator, first it is useful to introduce a result which is a simplified version of Lemma 3.2 reported in Chap. 3, known as Stein’s lemma.

Lemma 1.1

(Stein’s lemma, based on [24]) Consider N Gaussian and independent random variables $y_i \sim \mathscr {N}(\theta _i,\sigma ^2)$. For $i=1,\ldots ,n$, let also $h:{\mathbb R}^N \rightarrow {\mathbb R}$ be a differentiable function such that $\mathscr {E} \left| \frac{\partial h (Y) }{\partial y_i} \right| < \infty $. Then, it holds that

$$ \mathscr {E} ( y_i - \theta _i ) h(Y) = \sigma ^2 \mathscr {E} \frac{\partial h (Y)}{\partial y_i}. $$

Proof

During the proof, we use $\mathscr {E}_{j \ne i}$ to denote the expectation conditional on $\{y_j\}_{ j \ne i}$. Also, abusing notation, h(x) with $x \in {\mathbb R}$ indicates the function h with ith argument set to x while the other arguments are set to $y_j \ j \ne i$.

Note that, in view of the independence assumptions, each $y_i$ conditional on $\{y_j\}_{ j \ne i}$ is still Gaussian with mean $\theta _i$ and variance $\sigma ^2$. Then, using integration by parts, one has

$$\begin{aligned} \mathscr {E}_{j \ne i} \left( \frac{\partial h (Y) }{\partial y_i} \right)= & {} \int _{-\infty }^{+\infty } \frac{\partial h(x)}{\partial x} \frac{\exp (-(x-\theta _i)^2/(2\sigma ^2))}{\sqrt{2 \pi }\sigma } dx\\= & {} \left[ h(x) \frac{\exp (-(x-\theta _i)^2/(2\sigma ^2))}{\sqrt{2 \pi } \sigma } \right] _{-\infty }^{+\infty } \\+ & {} \int _{-\infty }^{+\infty } \frac{(x-\theta _i)}{\sigma ^2} h(x) \frac{\exp (-(x-\theta _i)^2/(2\sigma ^2))}{\sqrt{2 \pi } \sigma } dx \\= & {} \int _{-\infty }^{+\infty } \frac{(x-\theta _i)}{\sigma ^2} h(x) \frac{\exp (-(x-\theta _i)^2/(2\sigma ^2))}{\sqrt{2 \pi }\sigma } dx \\= & {} \frac{\mathscr {E}_{j \ne i} \left( (y_i-\theta _i) h(Y) \right) }{\sigma ^2}. \end{aligned}$$

Note that the penultimate equality exploits the fact that $h(x) \exp (-(x-\theta _i)^2/(2\sigma ^2))$ must be infinitesimal as $x \rightarrow \infty $, otherwise the assumption $\mathscr {E} \left| \frac{\partial h (Y) }{\partial y_i} \right| < \infty $ would not hold. Using the above result, we obtain

$$\begin{aligned} \mathscr {E} \left( ( y_i - \theta _i) h( Y) \right)= & {} \mathscr {E} \left[ \mathscr {E}_{j \ne i} \left( ( y_i - \theta _i) h( Y ) \right) \right] \\= & {} \sigma ^2 \mathscr {E} \left[ \mathscr {E}_{j \ne i} \left( \frac{\partial h (Y) }{\partial y_i} \right) \right] \\= & {} \sigma ^2 \mathscr {E} \frac{\partial h (Y) }{\partial y_i} \end{aligned}$$

and this completes the proof. $\square $

We now show that the MSE of the James–Stein estimator is uniformly smaller than the MSE of least squares. One has

$$\begin{aligned} MSE_{JS}= & {} \mathscr {E} \left( \Vert \theta - \hat{\theta }^{JS}(Y) \Vert ^2 \right) \\= & {} \mathscr {E} \left( \Vert \theta - Y + \frac{(N-2)\sigma ^2}{\Vert Y \Vert ^2} Y \Vert ^2 \right) \\= & {} \mathscr {E} \left( \Vert \theta - Y \Vert ^2 +\frac{(N-2)^2\sigma ^4}{\Vert Y \Vert ^4} \Vert Y \Vert ^2 + 2(\theta - Y)^T Y\frac{(N-2)\sigma ^2}{\Vert Y \Vert ^2} \right) \\= & {} N\sigma ^2 + \mathscr {E} \left( \frac{(N-2)^2\sigma ^4}{\Vert Y \Vert ^2} + 2(\theta - Y)^T Y\frac{(N-2)\sigma ^2}{\Vert Y \Vert ^2} \right) . \end{aligned}$$

As for the last term inside the expectation, exploiting Stein’s lemma with

$$ h_i(Y ) = \frac{y_i}{ \Vert Y \Vert ^2}, \quad \frac{\partial h_i (Y) }{\partial y_i} = \frac{1}{\Vert Y \Vert ^2} - 2 \frac{y_i^2}{\Vert Y \Vert ^4}, $$

one has

$$\begin{aligned} \mathscr {E} \left( \frac{(\theta - Y)^T Y }{\Vert Y \Vert ^2} \right)= & {} \mathscr {E} \left( \sum _{i=1}^N (\theta _i - y_i) h_i(Y ) \right) \\= & {} - \sigma ^2 \mathscr {E} \left( \sum _{i=1}^N \left( \frac{1}{\Vert Y \Vert ^2} - 2 \frac{y_i^2}{\Vert Y \Vert ^4} \right) \right) \\= & {} - \sigma ^2 \mathscr {E} \left( \frac{N-2}{\Vert Y \Vert ^2} \right) . \end{aligned}$$

Using this equality in the MSE expression, we finally obtain

$$\begin{aligned} MSE_{JS}= & {} N\sigma ^2 + \mathscr {E} \left( \frac{(N-2)^2\sigma ^4}{\Vert Y \Vert ^2} -2 \frac{(N-2)^2\sigma ^4}{\Vert Y \Vert ^2} \right) \\= & {} N\sigma ^2 - (N-2)^2\sigma ^4 \mathscr {E} \left( \frac{1}{\Vert Y \Vert ^2} \right) < N\sigma ^2. \end{aligned}$$

Notes

1.
In future chapters, $\theta _0$ will be used to denote the true value of the deterministic vector that has generated the data, distinguishing it from the vector which parametrizes the model. In this introductory chapter, $\theta $ is instead used in both the cases to maintain the notation as simple as possible.

References

Berger JO (1980) A robust generalized Bayes estimator and confidence region for a multivariate normal mean. Ann Stat 8:716–761
MathSciNet MATH Google Scholar
Berger JO (1982) Selecting a minimax estimator of a multivariate normal mean. Ann Stat 10:81–92
MathSciNet MATH Google Scholar
Berger JO (1994) An overview of robust Bayesian analysis. Test 3:5–124
Article MathSciNet Google Scholar
Bertero M (1989) Linear inverse and ill-posed problems. Adv Electron Electron Phys 75:1–120
Article Google Scholar
Bhattacharya PK (1966) Estimating the mean of a multivariate normal population with general quadratic loss function. Ann Math Stat 37:1819–1824
Article MathSciNet Google Scholar
Bock ME (1975) Minimax estimators of the mean of a multivariate normal distribution. Ann Stat 3:209–218
Article MathSciNet Google Scholar
Brandwein AC, Strawderman WE (2012) Stein estimation for spherically symmetric distributions: recent developments. Stat Sci 27:11–23
Article MathSciNet Google Scholar
Casella G (1980) Minimax ridge regression estimation. Ann Stat 8:1036–1056
Article MathSciNet Google Scholar
Casella G, Berger R (2001) Statistical inference. Cengage Learning
Google Scholar
Efron B, Morris C (1973) Stein’s estimation rule and its competitors - an empirical Bayes approach. J Am Stat Assoc 68(341):117–130
MathSciNet MATH Google Scholar
Greenberg E, Webster CE (1983) Advanced econometrics: a bridge to the literature. Wiley
Google Scholar
Guo YY, Pal N (1992) A sequence of improvements over the James–Stein estimator. J Multivar Anal 42:302–317
Article MathSciNet Google Scholar
Hoerl AE (1962) Application of ridge analysis to regression problems. Chem Eng Prog 58:54–59
Google Scholar
Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12:55–67
Article Google Scholar
James W, Stein C (1961) Estimation with quadratic loss. In: Proceedings of the 4th Berkeley symposium on mathematical statistics and probability, vol. I. University of California Press, pp 361–379
Google Scholar
Kubokawa T (1991) An approach to improving the James–Stein estimator. J Multivar Anal 36:121–126
Article MathSciNet Google Scholar
Ljung L (1999) System identification - theory for the user, 2nd edn. Prentice-Hall, Upper Saddle River
MATH Google Scholar
Manton JH, Krishnamurthy V, Poor HV (1998) James–Stein state filtering algorithms. IEEE Trans Signal Process 46(9):2431–2447
Article Google Scholar
Maruyama Y, Strawderman WE (2005) A new class of generalized Bayes minimax ridge regression estimators. Ann Stat 1753–1770
Google Scholar
Phillips DL (1962) A technique for the numerical solution of certain integral equations of the first kind. J Assoc Comput Mach 9:84–97
Article MathSciNet Google Scholar
Pillonetto G, Dinuzzo F, Chen T, De Nicolao G, Ljung L (2014) Kernel methods in system identification, machine learning and function estimation: a survey. Automatica 50
Google Scholar
Shinozaki N (1974) A note on estimating the mean vector of a multivariate normal distribution with general quadratic loss function. Keio Eng Rep 27:105–112
MathSciNet MATH Google Scholar
Stein C (1956) Inadmissibility of the usual estimator for the mean of a multivariate distribution. In: Proceedings of the 3rd Berkeley symposium on mathematical statistics and probability, vol I. University of California Press, pp 197–206
Google Scholar
Stein C (1981) Estimation of the mean of a multivariate normal distribution. Ann Stat 9:1135–1151
Article MathSciNet Google Scholar
Strawderman WE (1971) Proper Bayes minimax estimators of the multivariate normal mean. Ann Math Stat 42:385–388
Article MathSciNet Google Scholar
Strawderman WE (1978) Minimax adaptive generalized ridge regression estimators. J Am Stat Assoc 73:623–627
Article MathSciNet Google Scholar
Tikhonov AN, Arsenin VY (1977) Solutions of ill-posed problems. Winston/Wiley, Washington, D.C
MATH Google Scholar
Tikhonov AN (1943) On the stability of inverse problems. Dokl Akad Nauk SSSR 39:195–198
MathSciNet Google Scholar
Tikhonov AN (1963) On the solution of incorrectly formulated problems and the regularization method. Dokl Akad Nauk SSSR 151:501–504
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Engineering, University of Padova, Padova, Italy
Gianluigi Pillonetto & Alessandro Chiuso
School of Data Science, The Chinese University of Hong Kong, Shenzhen, China
Tianshi Chen
Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy
Giuseppe De Nicolao
Department of Electrical Engineering, Linköping University, Linköping, Sweden
Lennart Ljung

Authors

Gianluigi Pillonetto
View author publications
You can also search for this author in PubMed Google Scholar
Tianshi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Chiuso
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe De Nicolao
View author publications
You can also search for this author in PubMed Google Scholar
Lennart Ljung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gianluigi Pillonetto .

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Pillonetto, G., Chen, T., Chiuso, A., De Nicolao, G., Ljung, L. (2022). Bias. In: Regularized System Identification. Communications and Control Engineering. Springer, Cham. https://doi.org/10.1007/978-3-030-95860-2_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-95860-2_1
Published: 14 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-95859-6
Online ISBN: 978-3-030-95860-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Bias

Abstract