1.1 The Stein Effect

Consider the following “basic” statistical problem. Starting from the realizations of N independent Gaussian random variables \(y_i \sim \mathscr {N}(\theta _i,\sigma ^2)\), our aim is to reconstruct the means \(\theta _i\), contained in the vector \(\theta \) seen as a deterministic but unknown parameter vector.Footnote 1 The estimation performance will be measured in terms of mean squared error (MSE). In particular, let \(\mathscr {E}\) and \(\Vert \cdot \Vert \) denote expectation and Euclidean norm, respectively. Then, given an estimator \(\hat{\theta }\) of an N-dimensional vector \(\theta \) with ith component \(\theta _i\), one has

$$\begin{aligned} MSE_{\hat{\theta }}= & {} \mathscr {E} \Vert \hat{\theta } - \theta \Vert ^2 \nonumber \\= & {} \underbrace{\sum _{i=1}^N \mathscr {E} (\hat{\theta }_i - \mathscr {E} \hat{\theta }_i )^2}_{Variance} + \underbrace{\sum _{i=1}^N (\theta _i - \mathscr {E} \hat{\theta }_i )^2}_{Bias^2}, \end{aligned}$$
(1.1)

where in the last passage we have decomposed the error into two components. The first one is the variance of the estimator while the difference between the mean and the true parameter values measures the bias. If the mean coincides with \(\theta \), the estimator is said to be unbiased. The total error thus has two contributions: the variance and the (squared) bias.

Note that the mean estimation problem introduced above is a simple instance of linear Gaussian regression. In fact, letting \(I_N\) be the \(N \times N\) identity matrix, the measurements model is

$$\begin{aligned} Y = \theta + E, \quad E \sim \mathscr {N}(0,\sigma ^2 I_N), \end{aligned}$$
(1.2)

where Y is the N-dimensional (column) vector with ith component \(y_i\). The most popular strategy to recover \(\theta \) from data is least squares which also corresponds to maximum likelihood in this Gaussian scenario. The solution minimizes

$$ \Vert Y - \theta \Vert ^2 $$

and is then given by

$$ \hat{\theta }^{LS} = Y. $$

Apparently, the obtained estimator is the most reasonable one. A first intuitive argument supporting it is the fact that the random variables \(\{y_j\}_{ j \ne i }\) seem unable to carry any information on \(\theta _i\), since all the noises \(e_i\) are independent. Hence, the natural estimate of \(\theta _i\) appears indeed its noisy observation \(y_i\). This estimator is also unbiased: for any \(\theta \) we have

$$ \mathscr {E} \left( \hat{\theta }^{LS}\right) = \mathscr {E} \left( Y \right) = \theta . $$

Hence, from (1.1) we see that the MSE coincides with its variance, which is constant over \(\theta \) and given by

$$ MSE_{LS} = \mathscr {E} \Vert \hat{\theta }^{LS} - \theta \Vert ^2 = N\sigma ^2. $$

According to Markov’s theorem \(\hat{\theta }^{LS}\) is also efficient. This means that its variance is equal to the Cramér–Rao limit: no unbiased estimate can be better than the least squares estimate, e.g., see [9, 17].

1.1.1 The James–Stein Estimator

By introducing some bias in the inference process, it is easy to obtain estimators which dominate strictly least squares (in the MSE sense) over certain parameter regions. The most trivial example is the constant estimator \(\hat{\theta } = a\). Its variance is null, so that its MSE reduces to the bias component \(\Vert \theta -a\Vert ^2\). Hence, even if the behaviour of \(\hat{\theta }\) is unacceptable in most of the parameter space, this estimator outperforms least squares in the region

$$ \{\theta \ \text {s.t.} \ \Vert \theta -a\Vert ^2<N\sigma ^2 \}. $$

Note a feature common to least squares and the constant estimator. Both of them do not attempt to trade bias and variance, they just set to zero one of the two MSE components in (1.1). An alternative route is the design of estimators which try to balance bias and variance. Rather surprisingly, we will now see that this strategy can dominate \(\hat{\theta }^{LS}\) over the entire parameter space.

The first criticisms about least squares were introduced by Stein in the ’50s [23] and can be so summarized. A good mean estimator \(\hat{\theta }\) should also lead to a good estimate of the Euclidean norm of \(\theta \). Thus, one should have

$$ \Vert \hat{\theta } \Vert \approx \Vert \theta \Vert . $$

But, if we consider the “natural” estimator \(\hat{\theta }^{LS}=Y\), in view of the independence of the errors \( e_i\), one obtains

$$ \mathscr {E} \Vert Y \Vert ^2 = N\sigma ^2 + \Vert \theta \Vert ^2. $$

This shows that the least squares estimator tends to overestimate \(\Vert \theta \Vert \). It thus seems desirable to correct \(\hat{\theta }^{LS}\) by shrinking the estimate towards the origin, e.g., adopting estimators of the form \(\hat{\theta }^{LS}(1-r)\), where r is a positive scalar. The most famous example is the James–Stein estimator [15] where r is determined from data as follows:

$$ r=\frac{(N-2)\sigma ^2}{\Vert Y \Vert ^2}, $$

hence leading to

$$ \hat{\theta }^{JS} = Y - \frac{(N-2)\sigma ^2}{\Vert Y \Vert ^2} Y. $$

Note that, even if all the components of Y are mutually independent, \(\hat{\theta }^{JS}\) exploits all of them to estimate each \(\theta _i\). The surprising outcome is that \(\hat{\theta }^{JS}\) outperforms \(\hat{\theta }^{LS}\) over all the parameter space, as illustrated in the next theorem.

Theorem 1.1

(James–Stein’s MSE, based on [15]) Consider  N Gaussian and independent random variables \(y_i \sim \mathscr {N}(\theta _i,\sigma ^2)\). Let also \(\hat{\theta }^{JS}\) denote the James–Stein estimator of the means, i.e.,

$$ \hat{\theta }^{JS} = Y - \frac{(N-2)\sigma ^2}{\Vert Y \Vert ^2} Y. $$

Then, if \(N \ge 3\), the MSE of \(\hat{\theta }^{JS}\) satisfies

$$ MSE_{JS} < N\sigma ^2 \quad \forall \theta . $$

We say that an estimator dominates another estimator if for all the \(\theta \) its MSE is not larger and for some \(\theta \) it is smaller. In statistics an estimator is then said to be admissible if no other estimator exists that dominates it in terms of MSE. The above theorem then shows that the least squares estimator of the mean of a multivariate Gaussian is not admissible if the dimension exceeds two. The reason is that, even when the Gaussians are independent, the global MSE can be reduced uniformly by adding some bias to the estimate. This is also graphically illustrated in Fig. 1.1 where \(MSE_{JS}\), along with its decomposition, is plotted as a function of the component \(\theta _1\) of the ten-dimensional vector \(\theta =[\theta _1 \ 0 \ldots 0]\) (noise variance is equal to one). One can see that \(MSE_{JS}<MSE_{LS}\) since the bias introduced by \(\hat{\theta }^{JS}\) is compensated by a greater reduction in the variance of the estimate. Note however that James–Stein improves the overall MSE and not the individual errors affecting the \(\theta _i\). This aspect can be important in certain applications where it is not desirable to trade a higher individual MSE for a smaller overall MSE.

Fig. 1.1
figure 1

Estimation of the mean \(\theta \in {\mathbb R}^{10}\) of a Gaussian with covariance equal to the identity matrix. The plot displays the mean squared error of least squares (\(MSE_{LS}\)) and of the James–Stein estimator (\(MSE_{JS}\)), including its bias-variance decomposition, as a function of \(\theta _1\) with \(\theta =[\theta _1 \ 0 \ldots 0]\)

It is easy to check that the James–Stein estimator admits the following interesting reformulation:

$$\begin{aligned} \begin{aligned} \hat{\theta }^{JS}&= \arg \min _\theta \Vert Y- \theta \Vert ^2 + \gamma \ \Vert \theta \Vert ^2\\&= Y \frac{1}{1 + \gamma }, \end{aligned} \end{aligned}$$
(1.3)

where the positive scalar \(\gamma \) is determined from data as follows:

$$\begin{aligned} \gamma = \frac{(N-2)\sigma ^2}{\Vert Y \Vert ^2-(N-2)\sigma ^2}. \end{aligned}$$
(1.4)

Equation (1.3) thus reveals that \(\hat{\theta }^{JS}\) is a particular version of regularized least squares, an estimator which will play a central role in this book. In particular, the objective in (1.3) contains two contrasting terms. The first one, \(\Vert Y- \theta \Vert ^2\), is a quadratic loss which measures the adherence to experimental data. The second one, \(\Vert \theta \Vert ^2\), is a regularizer which shrinks the estimate towards the origin by penalizing the energy of the solution. The role of the regularization parameter \(\gamma \) is then to balance these two components via a simple scalar adjustment. Equation (1.4) shows that James–Stein’s strategy is to set its value to the inverse of an estimate of the signal-to-noise ratio.

1.1.2 Extensions of the James–Stein Estimator \(\star \)

We have seen that the James–Stein estimator corrects each component of \(\hat{\theta }^{LS}\) shifting it towards the origin. This implies that the MSE improvement will be better when the components of \(\theta \) are close to zero. Actually, there is nothing special in the origin. If the true \(\theta \) is expected to be close to \(a \in {\mathbb R}^N\), one can modify the original \(\hat{\theta }^{JS}\) as follows:

$$ \hat{\theta }^{JS} = Y - \frac{(N-2)\sigma ^2}{\Vert Y - a \Vert ^2} \left( Y - a\right) . $$

The result is an estimator which still dominates least squares, with the origin’s role now played by a. The estimator thus concentrates the MSE improvement around a.

Now, let us consider a non-orthonormal scenario where Gaussian linear regression now amounts to estimating \(\theta \) from the N measurements

$$ y_i = d_i \theta _i + e_i \quad e_i \sim \mathscr {N}(0,1), $$

with all the noises \(e_i\) mutually independent. The least squares (maximum likelihood) estimator is now

$$ \hat{\theta }^{LS}_i = \frac{y_i}{d_i}, \quad i=1,\ldots ,N, $$

and its MSE is the sum of the variances of \(\hat{\theta }^{LS}_i\), i.e.,

$$ MSE_{LS} = \sum _{i=1}^N \frac{1}{d_i^2}. $$

Note that the MSE can be large when just one of the \(d_i\) is small. In this case, the problem is said to be ill-conditioned: even a moderate measurement error can lead to a large reconstruction error.

Also in this non-orthonormal scenario, it is possible to design estimators whose MSE is uniformly smaller than \(MSE_{LS}\). The number of possible choices is huge, depending on which region of the parameter space one wants to concentrate the improvement. There is however an important limitation shared by all of Stein-type estimators: in general they are not much effective against ill-conditioning. This is illustrated in the following example. It illustrates an estimator whose negative features are well representative of some drawbacks of Stein’s estimation in non-orthogonal settings.

Example 1.2

(A generalization of James–Stein) Consider the estimator \(\hat{\theta }\) whose ith component is given by

$$\begin{aligned} \hat{\theta }_i = \left[ 1 - \frac{N-2}{S}d_i^2 \right] \frac{y_i}{d_i}, \quad i=1,\ldots ,N, \end{aligned}$$
(1.5)

where

$$ S = \sum _{i=1}^N d_i^2 y_i^2. $$

It is now shown that \(\hat{\theta }\) is a generalization of James–Stein able to outperform least squares over the entire parameter space. In fact, defining

$$ h_i(Y) = -d_i^2 \frac{N-2}{S}y_i, $$

after simple computations we obtain

$$\begin{aligned} MSE_{\hat{\theta }}= & {} \sum _{i=1}^N \frac{1}{d_i^2} + \mathscr {E} \left[ 2 \sum _{i=1}^N \frac{(y_i -d_i \theta _i)h_i(Y)}{d_i^2} + \sum _{i=1}^N \frac{h_i^2(Y)}{d_i^2} \right] \\= & {} \sum _{i=1}^N \frac{1}{d_i^2} + \mathscr {E} \left[ 2 \sum _{i=1}^N \frac{1}{d_i^2} \frac{\partial h_i (Y)}{\partial y_i} + \sum _{i=1}^N \frac{h_i^2(Y)}{d_i^2} \right] , \end{aligned}$$

where the last equality comes from Lemma 1.1 reported in Sect. 1.4. Since

$$ \frac{\partial h_i (Y)}{\partial y_i} = -d_i^2 \frac{N-2}{S}+ d_i^4\frac{N-2}{S^2}2 y_i^2, $$

one has

$$\begin{aligned}&\mathscr {E} \left[ 2 \sum _{i=1}^N \frac{1}{d_i^2} \frac{\partial h_i (Y)}{\partial y_i} + \sum _{i=1}^N \frac{h_i^2(Y)}{d_i^2} \right] \\= & {} \mathscr {E} \left[ - \frac{2(N-2)N}{S} + \frac{2(N-2)}{S^2} \sum _{i=1}^N 2d_i^2 y_i^2 + \frac{(N-2)^2}{S^2} \sum _{i=1}^N d_i^2 y_i^2 \right] \\= & {} - \mathscr {E} \frac{(N-2)^2}{S} < 0 \end{aligned}$$

which implies

$$ MSE_{\hat{\theta }} < MSE_{\hat{\theta }^{LS}} \ \ \forall \theta . $$

However, assume that the problem is ill-conditioned. Then, if one \(d_i\) is small and the values of \(d_i\) are quite spread, we could well have \(d_i^2 /S \approx 0\). Hence, (1.5) essentially reduces to

$$ \hat{\theta }_i = \left[ 1 - \frac{N-2}{S}d_i^2 \right] \frac{y_i}{d_i} \approx \frac{y_i}{d_i}, $$

which is the least squares estimate of \(\theta _i\). This means that the signal components mostly influenced by the noise, i.e., associated with small \(d_i\), are not regularized. Thus, in presence of ill-conditioning, \(\hat{\theta }\) will likely return an estimate affected by large errors.    \(\square \)

1.2 Ridge Regression

Consider now one of the fundamental problems in system identification. The task is to estimate the impulse response \(g^0\) of a discrete-time, linear and causal dynamic system, starting from noisy output data. The measurements model is

$$\begin{aligned} y(t) = \sum _{k=1}^{\infty } g_k^0 u(t-k) + e(t), \ \ t=1,\ldots ,N, \end{aligned}$$
(1.6)

where t denotes time, the sampling interval is one time unit for simplicity, the \(g_k^0\) indicate the impulse response coefficients, u(t) is the known system input while e(t) is the noise.

To determine the impulse response from input–output measurements, one of the main questions is how to parametrize the unknown \(g^0\). The classical approach, which will be also reviewed in the next chapter, introduces a collection of impulse response models \(g(\theta )\), each parametrized by a different vector \(\theta \). In particular, here we will adopt an FIR model of order m, i.e., \(g_k(\theta )=\theta _k\) for \(k=1, \ldots ,m\) and zero elsewhere. This permits to reformulate (1.6) as a linear regression: we stack all the elements y(t) and e(t) to form the vectors Y and E and obtain the model

$$ Y= \varPhi \theta + E $$

with the regression matrix \(\varPhi \in {\mathbb R}^{N \times m}\) given by

$$\small \varPhi = \left( \begin{array}{ccccc} u(0) &{} u(-1) &{} u(-2) &{} \ldots &{} u(-m+1) \\ u(1) &{} u(0) &{} u(-1) &{} \ldots &{} u(-m) \\ \ldots &{} &{} &{} &{} \\ u(N-1) &{} u(N-2) &{} u(N-3) &{} \ldots &{} u(N-m) \end{array}\right) . $$

We can now use least squares to estimate \(\theta \). Assuming \(\varPhi ^T\varPhi \) of full rank, we obtain

$$\begin{aligned} \hat{\theta }^{LS}&={{\,\mathrm{arg\,min}\,}}_\theta \Vert Y-\varPhi \theta \Vert ^2\end{aligned}$$
(1.7a)
$$\begin{aligned}&=(\varPhi ^T\varPhi )^{-1}\varPhi ^TY. \end{aligned}$$
(1.7b)

Note that the impulse response estimate is function of the FIR order which corresponds to the dimension m of \(\theta \). The choice of m is a trade-off between bias (a large m is needed to describe slowly decaying impulse responses without too much error) and variance (large m requires estimation of many parameters leading to large variance). This can be illustrated with a numerical experiment. The unknown impulse response \(g^0\) is defined by the following rational transfer function:

$$ \frac{(z+1)^2}{z(z-0.8)(z-0.7)}, $$

which, in practice, is equal to zero after less than 50 samples (\(g^0\) is the red line in Fig. 1.3). We estimate the system from 1000 outputs corrupted by white and Gaussian noises e(t) of variance equal to the variance of the noiseless output divided by 50, see Fig. 1.2 (bottom panel). Data come from the system initially at rest and then fed at \(t=0\) with white noise low-pass filtered by \(z/(z-0.99)\), see Fig. 1.2 (top panel). The reconstruction error is very large if we try to estimate \(g^0\) with \(m=50\): linear models are easy to estimate but the drawback is that high-order FIR may suffer from high variance. Hence, it is important to select a model order which well balances bias and variance. To do that one needs to try different values of m then using some validation procedures to determine the “optimal” one. In this case, since the true \(g^0\) is known, we can obtain the best value by selecting that \(m \in [1,\ldots ,50]\) which minimizes the MSE. This is an example of oracle-based procedure not implementable in practice: the optimal order is selected exploiting the knowledge of the true system. We obtain \(m=18\) which corresponds to \(MSE_{LS}=70.7\) and leads to the impulse response estimate displayed in Fig. 1.3. Even if the data set size is large and the signal-to-noise ratio is good, the estimate is far from satisfactory. The reason is that the low-pass input has poor excitation and leads to an ill-conditioned problem. This means that the condition number of the regression matrix \(\varPhi \) is large so that also a small output error can produce a large reconstruction error.

Fig. 1.2
figure 2

Input–output data

Fig. 1.3
figure 3

True impulse response \(g^0\) (thick red line) and least squares estimate

An alternative to the classical paradigm, where different model structures are introduced, is the following straightforward generalization of (1.3), known as ridge regression [13, 14]:

$$\begin{aligned} \hat{\theta }^{R}&={{\,\mathrm{arg\,min}\,}}_\theta \Vert Y-\varPhi \theta \Vert ^2 + \gamma \Vert \theta \Vert ^2\end{aligned}$$
(1.8a)
$$\begin{aligned}&=(\varPhi ^T\varPhi +\gamma I_{m})^{-1}\varPhi ^TY, \end{aligned}$$
(1.8b)

where we set \(m=50\) to solve our problem. Letting \(A=(\varPhi ^T\varPhi +\gamma I_{m})^{-1}\varPhi ^T\), it is easy to derive the MSE decomposition associated with \(\hat{\theta }^{R}\):

$$\begin{aligned} MSE_{R} = \underbrace{\sigma ^2 \text{ Trace } (AA^T)}_{Variance} + \underbrace{\Vert \theta - A \varPhi \theta \Vert ^2}_{Bias^2}. \end{aligned}$$
(1.9)

Figure 1.4 displays \(MSE_{R}\) for the particular system identification problem at hand as a function of the regularization parameter. Note that \(\gamma \) plays the role of the model order in the classical scenario but can be tuned in a continuous manner to reach a good bias-variance trade-off. It is also interesting to see its influence on the variance and bias components. The variance is a decreasing function of the regularization parameter. Hence, its maximum is reached for \(\gamma =0\) where \(\hat{\theta }^{R}\) reduces to the least squares estimator \(\hat{\theta }^{LS}\) given by (1.7) with \(m=50\). Instead, the bias increases with \(\gamma \). At the limit, for \(\gamma \rightarrow \infty \), the penalty \(\Vert \theta \Vert ^2\) is so overweighted that \(\hat{\theta }^{R}\) becomes the constant estimator centred on the origin (it returns all null impulse response coefficients).

In Fig. 1.5, we finally display the ridge regularized estimate with \(\gamma \) set to the value minimizing the error and leading to \(MSE_{R}=16.8\). It is evident that ridge regression provides a much better bias-variance trade-off than selecting the FIR order.

Fig. 1.4
figure 4

MSE of ridge regression and its bias-variance decomposition as a function of the regularization parameter

Fig. 1.5
figure 5

True impulse response \(g^0\) (thick red line) and ridge regularized estimate

1.3 Further Topics and Advanced Reading

Stein’s intuition on the development of an estimator able to dominate least squares in terms of global MSE can be found in [23], while the specific shape of \(\hat{\theta }^{JS}\) has been obtained in [15]. From then, a large variety of different estimators outperforming least squares, also under different losses, have been designed. It has been proved that there exists estimators which dominate James–Stein, even if the MSE improvement is not large, as described in [12, 16, 25]. Extensions and applications can be found in [5, 6, 11, 22, 24, 26]. A James–Stein version of the Kalman filter is derived in [18]. For interesting discussions on the limitations of Stein-type estimators in facing ill-conditioning see [8] but also [19] for new outcomes with better numerical stability properties. Other developments are reported in [7] where generalizations of Stein’s lemma are also described.

The paper [10] describes connections between James–Stein estimation and the so-called empirical Bayes approaches which will be treated later on in this book. The interplay between Stein-type estimators and the Bayes approach is also discussed in [2]. Here, one can also find an estimator which dominates least squares concentrating the MSE improvement in an ellipsoid that can be chosen by the user in the parameter space. This approach is deeply connected with robust Bayesian estimation concepts, e.g., see [1, 3].

The term ridge regression has been popularized by the works [13, 14]. This approach, introduced to guard against ill-conditioning and numerical instability, is an example of Tikhonov regularization for ill-posed problems. Among the first classical works on regularization and inverse problems, it is worth already citing [4, 20, 27,28,29]. A recent survey on the use of regularization for system identification can be instead found in [21]. The literature on this topic is huge and other relevant works will be cited in the next chapters.

1.4 Appendix: Proof of Theorem 1.1

To discuss the properties of the James–Stein estimator, first it is useful to introduce a result which is a simplified version of Lemma 3.2 reported in Chap. 3, known as Stein’s lemma.

Lemma 1.1

(Stein’s lemma, based on [24]) Consider N Gaussian and independent random variables \(y_i \sim \mathscr {N}(\theta _i,\sigma ^2)\). For \(i=1,\ldots ,n\), let also \(h:{\mathbb R}^N \rightarrow {\mathbb R}\) be a differentiable function such that \(\mathscr {E} \left| \frac{\partial h (Y) }{\partial y_i} \right| < \infty \). Then, it holds that

$$ \mathscr {E} ( y_i - \theta _i ) h(Y) = \sigma ^2 \mathscr {E} \frac{\partial h (Y)}{\partial y_i}. $$

Proof

During the proof, we use \(\mathscr {E}_{j \ne i}\) to denote the expectation conditional on \(\{y_j\}_{ j \ne i}\). Also, abusing notation, h(x) with \(x \in {\mathbb R}\) indicates the function h with ith argument set to x while the other arguments are set to \(y_j \ j \ne i\).

Note that, in view of the independence assumptions, each \(y_i\) conditional on \(\{y_j\}_{ j \ne i}\) is still Gaussian with mean \(\theta _i\) and variance \(\sigma ^2\). Then, using integration by parts, one has

$$\begin{aligned} \mathscr {E}_{j \ne i} \left( \frac{\partial h (Y) }{\partial y_i} \right)= & {} \int _{-\infty }^{+\infty } \frac{\partial h(x)}{\partial x} \frac{\exp (-(x-\theta _i)^2/(2\sigma ^2))}{\sqrt{2 \pi }\sigma } dx\\= & {} \left[ h(x) \frac{\exp (-(x-\theta _i)^2/(2\sigma ^2))}{\sqrt{2 \pi } \sigma } \right] _{-\infty }^{+\infty } \\+ & {} \int _{-\infty }^{+\infty } \frac{(x-\theta _i)}{\sigma ^2} h(x) \frac{\exp (-(x-\theta _i)^2/(2\sigma ^2))}{\sqrt{2 \pi } \sigma } dx \\= & {} \int _{-\infty }^{+\infty } \frac{(x-\theta _i)}{\sigma ^2} h(x) \frac{\exp (-(x-\theta _i)^2/(2\sigma ^2))}{\sqrt{2 \pi }\sigma } dx \\= & {} \frac{\mathscr {E}_{j \ne i} \left( (y_i-\theta _i) h(Y) \right) }{\sigma ^2}. \end{aligned}$$

Note that the penultimate equality exploits the fact that \(h(x) \exp (-(x-\theta _i)^2/(2\sigma ^2))\) must be infinitesimal as \(x \rightarrow \infty \), otherwise the assumption \(\mathscr {E} \left| \frac{\partial h (Y) }{\partial y_i} \right| < \infty \) would not hold. Using the above result, we obtain

$$\begin{aligned} \mathscr {E} \left( ( y_i - \theta _i) h( Y) \right)= & {} \mathscr {E} \left[ \mathscr {E}_{j \ne i} \left( ( y_i - \theta _i) h( Y ) \right) \right] \\= & {} \sigma ^2 \mathscr {E} \left[ \mathscr {E}_{j \ne i} \left( \frac{\partial h (Y) }{\partial y_i} \right) \right] \\= & {} \sigma ^2 \mathscr {E} \frac{\partial h (Y) }{\partial y_i} \end{aligned}$$

and this completes the proof.    \(\square \)

We now show that the MSE of the James–Stein estimator is uniformly smaller than the MSE of least squares. One has

$$\begin{aligned} MSE_{JS}= & {} \mathscr {E} \left( \Vert \theta - \hat{\theta }^{JS}(Y) \Vert ^2 \right) \\= & {} \mathscr {E} \left( \Vert \theta - Y + \frac{(N-2)\sigma ^2}{\Vert Y \Vert ^2} Y \Vert ^2 \right) \\= & {} \mathscr {E} \left( \Vert \theta - Y \Vert ^2 +\frac{(N-2)^2\sigma ^4}{\Vert Y \Vert ^4} \Vert Y \Vert ^2 + 2(\theta - Y)^T Y\frac{(N-2)\sigma ^2}{\Vert Y \Vert ^2} \right) \\= & {} N\sigma ^2 + \mathscr {E} \left( \frac{(N-2)^2\sigma ^4}{\Vert Y \Vert ^2} + 2(\theta - Y)^T Y\frac{(N-2)\sigma ^2}{\Vert Y \Vert ^2} \right) . \end{aligned}$$

As for the last term inside the expectation, exploiting Stein’s lemma with

$$ h_i(Y ) = \frac{y_i}{ \Vert Y \Vert ^2}, \quad \frac{\partial h_i (Y) }{\partial y_i} = \frac{1}{\Vert Y \Vert ^2} - 2 \frac{y_i^2}{\Vert Y \Vert ^4}, $$

one has

$$\begin{aligned} \mathscr {E} \left( \frac{(\theta - Y)^T Y }{\Vert Y \Vert ^2} \right)= & {} \mathscr {E} \left( \sum _{i=1}^N (\theta _i - y_i) h_i(Y ) \right) \\= & {} - \sigma ^2 \mathscr {E} \left( \sum _{i=1}^N \left( \frac{1}{\Vert Y \Vert ^2} - 2 \frac{y_i^2}{\Vert Y \Vert ^4} \right) \right) \\= & {} - \sigma ^2 \mathscr {E} \left( \frac{N-2}{\Vert Y \Vert ^2} \right) . \end{aligned}$$

Using this equality in the MSE expression, we finally obtain

$$\begin{aligned} MSE_{JS}= & {} N\sigma ^2 + \mathscr {E} \left( \frac{(N-2)^2\sigma ^4}{\Vert Y \Vert ^2} -2 \frac{(N-2)^2\sigma ^4}{\Vert Y \Vert ^2} \right) \\= & {} N\sigma ^2 - (N-2)^2\sigma ^4 \mathscr {E} \left( \frac{1}{\Vert Y \Vert ^2} \right) < N\sigma ^2. \end{aligned}$$