Best linear estimation via minimization of relative mean squared error


We propose methods to construct a biased linear estimator for linear regression which optimizes the relative mean squared error (MSE). Although there have been proposed biased estimators which are shown to have smaller MSE than the ordinary least squares estimator, our construction is based on the minimization of relative MSE directly. The performance of the proposed methods is illustrated by a simulation study and a real data example. The results show that our methods can improve on MSE, particularly when there exists correlation among the predictors.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4


  1. Akaike, H.: A new look at the statistical model identification. IEEE Trans. Automat. Control 19(6), 716–723 (1974)

    MathSciNet  Article  MATH  Google Scholar 

  2. Golub, G.H., Heath, M., Wahba, G.: Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21(2), 215–223 (1979)

    MathSciNet  Article  MATH  Google Scholar 

  3. Grant, M., Boyd, S.: Graph implementations for nonsmooth convex programs. In Blondel, V., Boyd, S., Kimura, H., (eds.) Recent Advances in Learning and Control, Lecture Notes in Control and Information Sciences, pp. 95–110. Springer (2008).

  4. Grant, M., Boyd, S.: CVX: Matlab software for disciplined convex programming, version 2.1. (2014)

  5. Hirst, J.D., King, R.D., Sternberg, M.J.: Quantitative structure-activity relationships by neural networks and inductive logic programming. i. the inhibition of dihydrofolate reductase by pyrimidines. J. Comput. Aided Mol. Des. 8(4), 405–420 (1994)

    Article  Google Scholar 

  6. Hoerl, A.E., Kennard, R.W.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1), 55–67 (1970)

    Article  MATH  Google Scholar 

  7. Janson, L., Fithian, W., Hastie, T.J.: Effective degrees of freedom: a flawed metaphor. Biometrika 102(2), 479–485 (2015)

    MathSciNet  Article  MATH  Google Scholar 

  8. Liu, K.: A new class of blased estimate in linear regression. Commun. Stat. Theory Methods 22(2), 393–402 (1993)

    Article  Google Scholar 

  9. Schwarz, G., et al.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)

    MathSciNet  Article  MATH  Google Scholar 

  10. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodological) 58(1), 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Howard D. Bondell.


Appendix A Proof of Theorem 1

Let \({{{\varvec{x}}}}_i\) and \({{{\varvec{m}}}}_i\) denote the ith row of \({{\varvec{X}}}\) and \({{\varvec{M}}}\), respectively. It is clear that when \(\tilde{\varvec{\beta }}^T\tilde{\varvec{\beta }}\le c'\), \(\mathrm{{Tr}}(\varvec{MM^T})\) can reach its unrestricted minimum 0 at \(\hat{{{\varvec{M}}}}\,=\,\varvec{0}\). So as long as \(\tilde{\varvec{\beta }}^T\tilde{\varvec{\beta }}\le c'\), \(\hat{{{\varvec{M}}}}\,=\,\varvec{0}\).

If \(c'=0\), the problem becomes to minimize \({{{\varvec{m}}}}_i^T{{{\varvec{m}}}}_i\) with respect to \({{\varvec{b}}}^T{{{\varvec{m}}}}_i=\tilde{\beta }_i\) for each \(i=1, 2,\ldots ,p\), whose solution is \({{{\varvec{m}}}}_i = \frac{\tilde{\beta }_i}{{{\varvec{b}}}^T{{\varvec{b}}}}{{\varvec{b}}}\) by the property of Moore-Penrose pseudoinverse. Therefore, \(\hat{{{\varvec{M}}}}\,=\,({{\varvec{b}}}^T{{\varvec{b}}})^{-1} {{\varvec{b}}}^T\otimes \tilde{\varvec{\beta }}\).

Now consider the situation that \(\tilde{\varvec{\beta }}^T\tilde{\varvec{\beta }}> c'>0\). Note that, if we increase the upper bound for the constraint, the objective function minimum value will be non-increasing since we expand the feasible region. Let \(c''\) denote the value such that the constraint equals \(c''\) at the minimizer for a given bound, \(c'\). Thus, it follows that \(c''\le c'\) and the objective function is constant for any choice of bound between \(c''\) and \(c'\). Therefore, without loss of generality, assume that the solution is obtained on the boundary. Then, the optimization problem described in (6) is equivalent to minimizing \(L({{\varvec{M}}},\lambda )\), which is defined as:

$$\begin{aligned} L({{\varvec{M}}},\lambda )=\sum _{i}^{p}{{{\varvec{m}}}}_i^T {{{\varvec{m}}}}_i+\lambda \left[ \sum _{i=1}^{p}({{\varvec{b}}}^T {{{\varvec{m}}}}_i-\tilde{\beta }_i)^2-c'\right] . \end{aligned}$$

Taking the derivative of \(L({{\varvec{M}}},\lambda )\) with respect to each \({{{\varvec{m}}}}_i\) (\(i=1, 2, \ldots , p\)) and setting them to 0, we have

$$\begin{aligned} {{{\varvec{m}}}}_i=-\lambda ({{{\varvec{b}}^{{\varvec{Tm}}}}}_i -\tilde{\beta }_i){{\varvec{b}}}, \qquad i=1,2,\ldots ,p. \end{aligned}$$

From (9) and the constraint \(\sum _{i=1}^{p}({{\varvec{b}}}^T\varvec{m}_i-\tilde{\beta }_i)^2=c'\), we can get that \(\sum _{i=1}^{p}{{\varvec{m}}}^T{{\varvec{m}}}_i =\lambda ^2\sum _{i=1}^{p}({{\varvec{b}}}^T{{\varvec{m}}}_i -\tilde{\beta }_i)^2{{\varvec{b}}}^T{{\varvec{b}}} =\lambda ^2c'{{\varvec{b}}}^T{{\varvec{b}}}\). This implies that \(\lambda \) cannot be a constant independent of \(c'\), otherwise the objective function minimum value \(\sum _{i=1}^{p}{{\varvec{m}}}^T{{\varvec{m}}}_i\) would be a strictly increasing function of \(c'\). Since \(\lambda \) is not a constant, we have \(\lambda {{\varvec{b}}}^T{{\varvec{b}}} + 1 \ne 0\), in particular. Multiplying both sides of (9) by \({{\varvec{b}}}^T\), and rearranging terms, we obtain that

$$\begin{aligned} {{\varvec{b}}}^T{{\varvec{m}}}_i=\frac{\lambda {{\varvec{b}}}^T{{\varvec{b}}}\tilde{\beta }_i}{\lambda {{\varvec{b}}}^T {{\varvec{b}}}\,+\,1}. \end{aligned}$$

Plug (10) into the constraint \(\sum _{i=1}^{p}({{\varvec{b}}}^T{{\varvec{m}}}_i-\tilde{\beta }_i)^2=c'\), and we can get

$$\begin{aligned} \lambda =\frac{-1\pm \sqrt{\sum _{i=1}^{p}\tilde{\beta }_i^2/c'}}{{{\varvec{b}}}^T{{\varvec{b}}}}. \end{aligned}$$

From (9) and (10), we have

$$\begin{aligned} {{\varvec{m}}}_i^T{{\varvec{m}}}_i&=\,\frac{\lambda ^2{{\varvec{b}}}^T{{\varvec{b}}}\tilde{\beta }_i^2}{(\lambda {{\varvec{b}}}^T{{\varvec{b}}}\,+\,1)^2}. \end{aligned}$$

For either choice of \(\lambda \), the denominator of (12) is \(\sum _{i=1}^{p}\tilde{\beta }_i^2/c'\), and we get \(\sum _{i=1}^{p}{{\varvec{m}}}_i^T{{\varvec{m}}}_i=\lambda ^2c'{{\varvec{b}}}^T{{\varvec{b}}}\). We know that \(\sum _{i=1}^{p}{{\varvec{m}}}_i^T{{\varvec{m}}}_i\) must be non-increasing in \(c'\). Hence, \(\lambda ^2\) must be non-increasing in \(c'\). Combining (11) and (12), we get \(\mathrm{{Tr}}(\varvec{MM}^T)=\sum _{i=1}^{p}{{\varvec{m}}}_i^T {{\varvec{m}}}_i=\frac{c'(-1\pm \sqrt{\sum _{i=1}^{p} \tilde{\beta }_i^2/c'})^2}{{{\varvec{b}}}^T{{\varvec{b}}}}\). Now, it can be shown directly that choosing \(\lambda =\frac{-1+\sqrt{\sum _{i=1}^{p}\tilde{\beta }_i^2/c'}}{{{\varvec{b}}}^T{{\varvec{b}}}}\) is the correct solution, as it is the only one that makes \(\mathrm{{Tr}}(\varvec{MM}^T)\) a strictly decreasing function of \(c'\) for \(0<c'<\tilde{\varvec{\beta }}^T\tilde{\varvec{\beta }}\). This strict monotonicity also proves that the optimal value is actually obtained on the boundary of the constraint for any \(0<c'<\tilde{\varvec{\beta }}^T\tilde{\varvec{\beta }}\). Based on (9), \(\hat{{{\varvec{m}}}}_i=(1/\lambda {{\varvec{I}}}_n \,+\,{{\varvec{b}}}{{\varvec{b}}}^T)^{-1}{{\varvec{b}}}\tilde{\beta }_i\) for all \(i=1, 2, \ldots , p\), which is equivalent to solving a Ridge regression with single response \(\tilde{\beta }_i\), covariate matrix \({{\varvec{b}}}^T\) and tuning parameter set at \(1/\lambda \). This further shows that the solution of \({{\varvec{M}}}\) is unique.

Appendix B Proof of Proposition 1

In order to prove Proposition 1, we require the following two lemmas, which we state here for completeness.

Lemma 1

For all symmetric \(p\times p\) matrix \(\varvec{A}\) and \(\forall \varvec{x}\in \mathbb {R}^p\), if \(\varvec{x}\ne \varvec{0}\), then,

$$\begin{aligned} \lambda _p=\underset{\varvec{x}}{{\min }} \frac{\varvec{x}^T\varvec{Ax}}{\varvec{x}^T \varvec{x}}\le \frac{\varvec{x}^T\varvec{Ax}}{\varvec{x}^T\varvec{x}}\le \underset{\varvec{x}}{{\max }}\frac{\varvec{x}^T\varvec{Ax}}{\varvec{x}^T\varvec{x}}\,=\,\lambda _1, \end{aligned}$$

where \(\lambda _1\ge \cdots \ge \lambda _p\) are the ordered eigenvalues of \(\varvec{A}\).

Lemma 2

(Schur Lemma) The matrix \(\begin{bmatrix} {{{\varvec{A}}}}&{{{\varvec{B}}}}\\ {{{\varvec{C}}}}&{{{\varvec{D}}}} \end{bmatrix}\) is n.n.d. iff the Schur complement of \({{{\varvec{D}}}}\), \({{{\varvec{A-BD}}^{-1}{{\varvec{C}}}}}\), is n.n.d.

Now to prove Proposition 1. By Lemma 1, \(\underset{\varvec{\beta }\in \mathbb {R}^p}{\text {max}} \frac{\varvec{\beta }^T(\varvec{MX}\,-\,{{\varvec{I}}}_p)^T ({{{\varvec{MX}}}}\,-\,{{\varvec{I}}}_p)\varvec{\beta }}{\varvec{\beta }^T\varvec{\beta }}\) is the largest eigenvalue of matrix \(({{{\varvec{MX}}}}\,-\,{{\varvec{I}}}_p)^T({{{\varvec{MX}}}}\,-\,{{\varvec{I}}}_p)\), noted as \(\lambda _1\). Hence, problem (7) is equivalent to

The last equivalence is based on Lemma 2.

Appendix C Proof of Theorem 2

Let \({{{\varvec{x}}}}_j^*\) denote the jth column of \({{\varvec{X}}}\). From Theorem 1, when \(c' \ge \tilde{\varvec{\beta }}^T\tilde{\varvec{\beta }}\), it follows that \(\mathrm{{Tr}}({{\varvec{X}}}\hat{{{\varvec{M}}}})=0\), since \(\hat{{{\varvec{M}}}}\,=\,\varvec{0}\). When \(c'=0\), \(\mathrm{{Tr}}({{\varvec{X}}}\hat{{{\varvec{M}}}})=\sum _{i=1}^{p}({{{\varvec{x}}}}_i^*)^T\hat{{{\varvec{m}}}}_i=[\sum _{i=1}^{p}(\varvec{x}_i^*)^T\tilde{\beta }_i](\frac{{{\varvec{b}}}}{{{\varvec{b}}}^T{{\varvec{b}}}})=1\). When \(0< c' < \tilde{\varvec{\beta }}^T\tilde{\varvec{\beta }}\), each row of \(\hat{{{\varvec{M}}}}\) satisfies (9) and (10). If we multiply \(({{{\varvec{x}}}}_i^*)^T\) to both sides of (9) and by (10), we have

$$\begin{aligned} (\varvec{x}^*_i)^T\hat{{{\varvec{m}}}}_i =\frac{\lambda }{\lambda {{\varvec{b}}}^T{{\varvec{b}}}\,+\,1} \tilde{\beta }_i(\varvec{x}^*_i)^T{{\varvec{b}}}. \end{aligned}$$


$$\begin{aligned} \mathrm{{Tr}}({{\varvec{X}}}\hat{{{\varvec{M}}}})=\sum _{i}^{p} (\varvec{x}_i^*)^T\hat{{{\varvec{m}}}}_i&=\frac{\lambda }{\lambda {{\varvec{b}}}^T{{\varvec{b}}}\,+\,1} \sum _{i=1}^{p}(\varvec{x}^*_i)^T\tilde{\beta }_i{{\varvec{b}}}\\&=\,\frac{\lambda }{\lambda {{\varvec{b}}}^T{{\varvec{b}}}\,+\,1} {{\varvec{b}}}^T{{\varvec{b}}}\\&= \,1-\frac{1}{\lambda {{\varvec{b}}}^T{{\varvec{b}}}\,+\,1}. \end{aligned}$$

Because \({{\varvec{b}}}\,=\,{{\varvec{X}}}\tilde{\varvec{\beta }}\) is nonzero, \({{\varvec{b}}}^T{{\varvec{b}}}\) is positive. We have already shown in the proof of Theorem 1 that \(\lambda =\frac{-1+ \sqrt{\sum _{i=1}^{p}\tilde{\beta }_i^2/c'}}{{{\varvec{b}}}^T\varvec{b}}\), which is a strictly decreasing function of \(c'\). So \(\text {Tr}({{\varvec{X}}}\hat{{{\varvec{M}}}})\) is a strictly decreasing function of \(c'\) when \(0< c' < \tilde{\varvec{\beta }}^T\tilde{\varvec{\beta }}\). When \(c'\rightarrow 0\), \(\text {Tr}({{\varvec{X}}}\hat{{{\varvec{M}}}})\rightarrow 1\), and as \(c'\rightarrow \tilde{\varvec{\beta }}^T\tilde{\varvec{\beta }}\), \(\text {Tr}({{\varvec{X}}}\hat{{{\varvec{M}}}})\rightarrow 0\). Therefore, the statement in the theorem holds.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Su, L., Bondell, H.D. Best linear estimation via minimization of relative mean squared error. Stat Comput 29, 33–42 (2019).

Download citation


  • Biased linear estimator
  • Smallest relative mean squared error
  • Ridge regression
  • Ordinary least squares