Skip to main content

Optimal Shrinkage Estimation in Heteroscedastic Hierarchical Linear Models

  • Chapter
  • First Online:
Big and Complex Data Analysis

Part of the book series: Contributions to Statistics ((CONTRIB.STAT.))

Abstract

Shrinkage estimators have profound impacts in statistics and in scientific and engineering applications. In this article, we consider shrinkage estimation in the presence of linear predictors. We formulate two heteroscedastic hierarchical regression models and study optimal shrinkage estimators in each model. A class of shrinkage estimators, both parametric and semiparametric, based on unbiased risk estimate (URE) is proposed and is shown to be (asymptotically) optimal under mean squared error loss in each model. Simulation study is conducted to compare the performance of the proposed methods with existing shrinkage estimators. We also apply the method to real data and obtain encouraging and interesting results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Berger, J.O., Strawderman, W.E.: Choice of hierarchical priors: admissibility in estimation of normal means. Ann. Stat. 24 (3), 931–951 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  2. Brown, L.D.: In-season prediction of batting averages: a field test of empirical Bayes and Bayes methodologies. Ann. Appl. Stat. 2 (1), 113–152 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  3. Copas, J.B.: Regression, prediction and shrinkage. J. R. Stat. Soc. Ser. B Methodol. 45 (3), 311–354 (1983)

    MathSciNet  MATH  Google Scholar 

  4. Efron, B., Morris, C.: Empirical Bayes on vector observations: an extension of Stein’s method. Biometrika 59 (2), 335–347 (1972)

    Article  MathSciNet  MATH  Google Scholar 

  5. Efron, B., Morris, C.: Stein’s estimation rule and its competitors—an empirical Bayes approach. J. Am. Stat. Assoc. 68 (341), 117–130 (1973)

    MathSciNet  MATH  Google Scholar 

  6. Efron, B., Morris, C.: Data analysis using Stein’s estimator and its generalizations. J. Am. Stat. Assoc. 70 (350), 311–319 (1975)

    Article  MATH  Google Scholar 

  7. Fearn, T.: A Bayesian approach to growth curves. Biometrika 62 (1), 89–100 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  8. Green, E.J., Strawderman, W.E.: The use of Bayes/empirical Bayes estimation in individual tree volume equation development. For. Sci. 31 (4), 975–990 (1985)

    Google Scholar 

  9. Hui, S.L., Berger, J.O.: Empirical Bayes estimation of rates in longitudinal studies. J. Am. Stat. Assoc. 78 (384), 753–760 (1983)

    Article  MATH  Google Scholar 

  10. James, W., Stein, C.: Estimation with quadratic loss. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 361–379. University of California Press, Berkeley (1961)

    Google Scholar 

  11. Jiang, J., Nguyen, T., Rao, J.S.: Best predictive small area estimation. J. Am. Stat. Assoc. 106 (494), 732–745 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  12. Jones, K.: Specifying and estimating multi-level models for geographical research. Trans. Inst. Br. Geogr. 16 (2), 148–159 (1991)

    Article  Google Scholar 

  13. Li, K.C.: Asymptotic optimality of C L and generalized cross-validation in ridge regression with application to spline smoothing. Ann. Stat. 14(3), 1101–1102 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  14. Lindley, D.V.: Discussion of a paper by C. Stein. J. R. Stat. Soc. Ser. B Methodol. 24, 285–287 (1962)

    Google Scholar 

  15. Lindley, D.V.V., Smith, A.F.M.: Bayes estimates for the linear model. J. R. Stat. Soc. Ser. B Methodol. 34 (1), 1–41 (1972)

    MathSciNet  MATH  Google Scholar 

  16. Morris, C.N.: Parametric empirical Bayes inference: theory and applications. J. Am. Stat. Assoc. 78 (381), 47–55 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  17. Morris, C.N., Lysy, M.: Shrinkage estimation in multilevel normal models. Stat. Sci. 27 (1), 115–134 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  18. Normand, S.L.T., Glickman, M.E., Gatsonis, C.A.: Statistical methods for profiling providers of medical care: issues and applications. J. Am. Stat. Assoc. 92 (439), 803–814 (1997)

    Article  MATH  Google Scholar 

  19. Omen, S.D.: Shrinking towards subspaces in multiple linear regression. Technometrics 24 (4), 307–311 (1982). 1982

    Google Scholar 

  20. Raftery, A.E., Madigan, D., Hoeting, J.A.: Bayesian model averaging for linear regression models. J. Am. Stat. Assoc. 92 (437), 179–191 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  21. Robbins, H.: An empirical Bayes approach to statistics. In: Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability. Contributions to the Theory of Statistics, vol. 1, pp. 157–163. University of California Press, Berkeley (1956)

    Google Scholar 

  22. Rubin, D.B.: Using empirical Bayes techniques in the law school validity studies. J. Am. Stat. Assoc. 75 (372), 801–816 (1980)

    Article  Google Scholar 

  23. Sclove, S.L., Morris, C., Radhakrishnan, R.: Non-optimality of preliminary-test estimators for the mean of a multivariate normal distribution. Ann. Math. Stat. 43 (5), 1481–1490 (1972)

    Article  MathSciNet  MATH  Google Scholar 

  24. Stein, C.: Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In: Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability. Contributions to the Theory of Statistics, vol. 1, pp. 197–206. University of California Press, Berkeley (1956)

    Google Scholar 

  25. Stein, C.M.: Confidence sets for the mean of a multivariate normal distribution (with discussion). J. R. Stat. Soc. Ser. B Stat Methodol. 24, 265–296 (1962)

    MATH  Google Scholar 

  26. Stein, C.: An approach to the recovery of inter-block information in balanced incomplete block designs. In: Neyman, F.J. (ed.) Research Papers in Statistics, pp. 351–366. Wiley, London (1966)

    Google Scholar 

  27. Strenio, J.F., Weisberg, H.I., Bryk, A.S.: Empirical Bayes estimation of individual growth-curve parameters and their relationship to covariates. Biometrics 39 (1), 71–86 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  28. Tan, Z.: Steinized empirical Bayes estimation for heteroscedastic data. Stat. Sin. 26, 1219–1248 (2016)

    MathSciNet  MATH  Google Scholar 

  29. Xie, X., Kou, S.C., Brown, L.D.: SURE estimates for a heteroscedastic hierarchical model. J. Am. Stat. Assoc. 107 (500), 1465–1479 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  30. Xie, X., Kou, S.C., Brown, L.D.: Optimal shrinkage estimation of mean parameters in family of distributions with quadratic variance. Ann. Stat. 44, 564–597 (2016)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

S. C. Kou’s research is supported in part by US National Science Foundation Grant DMS-1510446.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. C. Kou .

Editor information

Editors and Affiliations

Appendix: Proofs and Derivations

Appendix: Proofs and Derivations

Proof of Lemma 1

We can write \(\boldsymbol{\theta }=\boldsymbol{\mu } +\boldsymbol{Z}_{1}\) and \(\boldsymbol{Y } =\boldsymbol{\theta } +\boldsymbol{Z}_{2}\), where \(\boldsymbol{Z}_{1} \sim \mathcal{N}_{p}(\boldsymbol{0},\boldsymbol{B})\) and \(\boldsymbol{Z}_{2} \sim \mathcal{N}_{p}(\boldsymbol{0},\boldsymbol{A})\) are independent. Jointly \({\boldsymbol{Y }\abovewithdelims()0.0pt \boldsymbol{\theta }}\) is still multivariate normal with mean vector \({\boldsymbol{\mu }\abovewithdelims()0.0pt \boldsymbol{\mu }}\) and covariance matrix \(\left (\begin{array}{cc} \boldsymbol{A} +\boldsymbol{ B}&\boldsymbol{B}\\ \boldsymbol{B} &\boldsymbol{B} \end{array} \right )\). The result follows immediately from the conditional distribution of a multivariate normal distribution.

Proof of Theorem 1

We start from decomposing the difference between the URE and the actual loss as

$$\displaystyle\begin{array}{rcl} & & \mathrm{URE}\left (\boldsymbol{B},\boldsymbol{\mu }\right ) - l_{p}\left (\boldsymbol{\theta },\boldsymbol{\hat{\theta }}^{\boldsymbol{B},\boldsymbol{\mu }}\right ) \\ & =& \mathrm{URE}\left (\boldsymbol{B},\boldsymbol{0}_{p}\right ) - l_{p}\left (\boldsymbol{\theta },\boldsymbol{\hat{\theta }}^{\boldsymbol{B},\boldsymbol{0}_{p} }\right ) -\dfrac{2} {p}\mathrm{tr}\left (\boldsymbol{A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\boldsymbol{\mu }\left (\boldsymbol{Y }-\boldsymbol{\theta }\right )^{T}\right ){}\end{array}$$
(18)
$$\displaystyle\begin{array}{rcl} & =& \dfrac{1} {p}\mathrm{tr}\left (\boldsymbol{Y Y }^{T} -\boldsymbol{ A} -\boldsymbol{\theta \theta }^{T}\right ) -\dfrac{2} {p}\mathrm{tr}\left (\boldsymbol{B}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\left (\boldsymbol{Y Y }^{T} -\boldsymbol{ Y \theta }^{T} -\boldsymbol{ A}\right )\right ) \\ & & -\dfrac{2} {p}\mathrm{tr}\left (\boldsymbol{A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\boldsymbol{\mu }\left (\boldsymbol{Y }-\boldsymbol{\theta }\right )^{T}\right ) \\ & =& \left (\mathrm{I}\right ) + \left (\mathrm{II}\right ) + \left (\mathrm{III}\right ). {}\end{array}$$
(19)

To verify the first equality (18), note that

$$\displaystyle\begin{array}{rcl} & & \mathrm{URE}\left (\boldsymbol{B},\boldsymbol{\mu }\right ) -\mathrm{ URE}\left (\boldsymbol{B},\boldsymbol{0}_{p}\right ) {}\\ & & = \dfrac{1} {p}\left \Vert \boldsymbol{A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\left (\boldsymbol{Y }-\boldsymbol{\mu }\right )\right \Vert ^{2} -\dfrac{1} {p}\left \Vert \boldsymbol{A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\boldsymbol{Y }\right \Vert ^{2} {}\\ & & = -\dfrac{1} {p}\mathrm{tr}\left (\boldsymbol{\mu }^{T}\left (\boldsymbol{A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\right )^{T}\boldsymbol{A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\left (2\boldsymbol{Y }-\boldsymbol{\mu }\right )\right ), {}\\ & & \quad l_{p}\left (\boldsymbol{\theta },\boldsymbol{\hat{\theta }}^{\boldsymbol{B},\boldsymbol{\mu }}\right ) - l_{ p}\left (\boldsymbol{\theta },\boldsymbol{\hat{\theta }}^{\boldsymbol{B},\boldsymbol{0}_{p} }\right ) {}\\ & & = \dfrac{1} {p}\left \Vert \left (\boldsymbol{I}_{p} -\boldsymbol{ A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\right )\boldsymbol{Y } +\boldsymbol{ A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\boldsymbol{\mu }-\boldsymbol{\theta }\right \Vert ^{2} {}\\ & & \quad -\dfrac{1} {p}\left \Vert \left (\boldsymbol{I}_{p} -\boldsymbol{ A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\right )\boldsymbol{Y }-\boldsymbol{\theta }\right \Vert ^{2} {}\\ & & = \dfrac{1} {p}\mathrm{tr}\left (\boldsymbol{\mu }^{T}\left (\boldsymbol{A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\right )^{T}\left (2\left (\left (\boldsymbol{I}_{ p} -\boldsymbol{ A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\right )\boldsymbol{Y -\theta }\right ) +\boldsymbol{ A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\boldsymbol{\mu }\right )\right ). {}\\ \end{array}$$

Equation (18) then follows by rearranging the terms. To verify the second equality (19), note

$$\displaystyle\begin{array}{rcl} & & \mathrm{URE}\left (\boldsymbol{B},\boldsymbol{0}_{p}\right ) - l_{p}\left (\boldsymbol{\theta },\boldsymbol{\hat{\theta }}^{\boldsymbol{B},\boldsymbol{0}_{p} }\right ) {}\\ & & = \dfrac{1} {p}\left \Vert \boldsymbol{A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\boldsymbol{Y }\right \Vert ^{2} -\dfrac{1} {p}\left \Vert \left (\boldsymbol{I}_{p} -\boldsymbol{ A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\right )\boldsymbol{Y }-\boldsymbol{\theta }\right \Vert ^{2} {}\\ & & \quad + \dfrac{1} {p}\mathrm{tr}\left (\boldsymbol{A} - 2\boldsymbol{A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\boldsymbol{A}\right ) {}\\ & & = \dfrac{1} {p}\mathrm{tr}\left (\left (\boldsymbol{Y } - 2\left (\boldsymbol{I}_{p} -\boldsymbol{ A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\right )\boldsymbol{Y }+\boldsymbol{\theta }\right )^{T}\left (\boldsymbol{Y }-\boldsymbol{\theta }\right )\right ) {}\\ & & \quad + \dfrac{1} {p}\mathrm{tr}\left (\boldsymbol{A} - 2\boldsymbol{A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\boldsymbol{A}\right ) {}\\ & & = \dfrac{1} {p}\mathrm{tr}\left (\boldsymbol{Y Y }^{T} -\boldsymbol{ A} -\boldsymbol{\theta \theta }^{T}\right ) -\dfrac{2} {p}\mathrm{tr}\left (\boldsymbol{B}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\left (\boldsymbol{Y }\left (\boldsymbol{Y }-\boldsymbol{\theta }\right )^{T} -\boldsymbol{ A}\right )\right ). {}\\ \end{array}$$

With the decomposition, we want to prove separately the uniform L 1 convergence of the three terms \(\left (\mathrm{I}\right )\), \(\left (\mathrm{II}\right )\), and \(\left (\mathrm{III}\right )\).

Proof for the case of Model I.

The uniform L 2 convergence of \(\left (\mathrm{I}\right )\) and \(\left (\mathrm{II}\right )\) has been shown in Theorem 3.1 of [29] under our assumptions \(\left (\mathrm{A}\right )\) and \(\left (\mathrm{B}\right )\), so we focus on \(\left (\mathrm{III}\right )\), i.e., we want to show that \(\sup \limits _{0\leq \lambda \leq \infty,\;\boldsymbol{\mu }\in \mathcal{L}}\left \vert \left (\mathrm{III}\right )\right \vert \rightarrow 0\) in L 1 as p → ∞.

Without loss of generality, let us assume A 1 ≤ A 2 ≤ ⋯ ≤ A p . We have

$$\displaystyle\begin{array}{rcl} \sup \limits _{0\leq \lambda \leq \infty,\;\boldsymbol{\mu }\in \mathcal{L}}\left \vert \left (\mathrm{III}\right )\right \vert & =& \dfrac{2} {p}\sup \limits _{0\leq \lambda \leq \infty,\;\boldsymbol{\mu }\in \mathcal{L}}\left \vert \sum _{i=1}^{p} \dfrac{A_{i}} {A_{i}+\lambda }\mu _{i}\left (Y _{i} -\theta _{i}\right )\right \vert {}\\ & \leq & \dfrac{2} {p}\sup \limits _{\boldsymbol{\mu }\in \mathcal{L}}\sup \limits _{0\leq c_{1}\leq \cdots \leq c_{p}\leq 1}\left \vert \sum _{i=1}^{p}c_{ i}\mu _{i}\left (Y _{i} -\theta _{i}\right )\right \vert {}\\ & =& \dfrac{2} {p}\sup \limits _{\boldsymbol{\mu }\in \mathcal{L}}\max \limits _{1\leq j\leq p}\left \vert \sum _{i=j}^{p}\mu _{ i}\left (Y _{i} -\theta _{i}\right )\right \vert, {}\\ \end{array}$$

where the last equality follows from Lemma 2.1 of [13]. For a generic p-dimensional vector \(\boldsymbol{v}\), we denote \([\boldsymbol{v}]_{j:p} = (0,\ldots 0,v_{j},v_{j+1},\ldots,v_{p})\). Let \(\boldsymbol{P}_{\boldsymbol{X}} =\boldsymbol{ X}^{T}\left (\boldsymbol{XX}^{T}\right )^{-1}\boldsymbol{X}\) be the projection matrix onto \(\mathcal{L}_{\mathrm{row}}\left (\boldsymbol{X}\right )\). Then since \(\mathcal{L}\subset \mathcal{L}_{\mathrm{row}}\left (\boldsymbol{X}\right )\), we have

$$\displaystyle\begin{array}{rcl} & & \dfrac{2} {p}\sup \limits _{\boldsymbol{\mu }\in \mathcal{L}}\max \limits _{1\leq j\leq p}\left \vert \sum _{i=j}^{p}\mu _{ i}\left (Y _{i} -\theta _{i}\right )\right \vert = \dfrac{2} {p}\max \limits _{1\leq j\leq p}\sup \limits _{\boldsymbol{\mu }\in \mathcal{L}}\left \vert \boldsymbol{\mu }^{T}[\boldsymbol{Y }-\boldsymbol{\theta }]_{ j:p}\right \vert {}\\ & =& \dfrac{2} {p}\max \limits _{1\leq j\leq p}\sup \limits _{\boldsymbol{\mu }\in \mathcal{L}}\left \vert \boldsymbol{\mu }^{T}\boldsymbol{P}_{\boldsymbol{ X}}[\boldsymbol{Y }-\boldsymbol{\theta }]_{j:p}\right \vert \leq \dfrac{2} {p}\max \limits _{1\leq j\leq p}\sup \limits _{\boldsymbol{\mu }\in \mathcal{L}}\left \Vert \boldsymbol{\mu }\right \Vert \times \left \Vert \boldsymbol{P}_{\boldsymbol{X}}[\boldsymbol{Y }-\boldsymbol{\theta }]_{j:p}\right \Vert {}\\ & =& \dfrac{2} {p}\max \limits _{1\leq j\leq p}Mp^{\kappa }\left \Vert \boldsymbol{Y }\right \Vert \times \left \Vert \boldsymbol{P}_{\boldsymbol{X}}[\boldsymbol{Y }-\boldsymbol{\theta }]_{j:p}\right \Vert. {}\\ \end{array}$$

Cauchy-Schwarz inequality thus gives

$$\displaystyle{ \mathbb{E}\left (\sup \limits _{0\leq \lambda \leq \infty,\boldsymbol{\mu }\in \mathcal{L}}\left \vert \left (\mathrm{III}\right )\right \vert \right ) \leq 2Mp^{\kappa -1}\sqrt{\mathbb{E}\left (\left \Vert \boldsymbol{Y } \right \Vert ^{2 } \right )} \times \sqrt{\mathbb{E}\left (\max \limits _{ 1\leq j\leq p}\left \Vert \boldsymbol{P}_{\boldsymbol{X}}[\boldsymbol{Y }-\boldsymbol{\theta }]_{j:p}\right \Vert ^{2}\right )}. }$$
(20)

It is straightforward to see that, by conditions (A) and (C),

$$\displaystyle{\sqrt{\mathbb{E}\left (\left \Vert \boldsymbol{Y } \right \Vert ^{2 } \right )} = \sqrt{\mathbb{E}( \sum \nolimits _{i=1 }^{p }Y _{i }^{2 })} = \sqrt{\sum \nolimits _{i=1 }^{p }\left (\theta _{i }^{2 } + A_{i } \right )} = O\left (p^{1/2}\right ).}$$

For the second term on the right-hand side of (20), let \(\boldsymbol{P}_{\boldsymbol{X}} =\boldsymbol{\varGamma D\varGamma }^{T}\) denote the spectral decomposition. Clearly,

$$\displaystyle{\boldsymbol{D} =\mathrm{ diag}\left (\mathop{\mathop{\underbrace{1,\ldots,1}}\limits }\limits_{k\text{ copies}},\mathop{\mathop{\underbrace{0,\ldots,0}}\limits }\limits_{p - k\text{ copies}}\right ).}$$

It follows that

$$\displaystyle\begin{array}{rcl} & & \mathbb{E}\left (\max \limits _{1\leq j\leq p}\left \Vert \boldsymbol{P}_{\boldsymbol{X}}[\boldsymbol{Y }-\boldsymbol{\theta }]_{j:p}\right \Vert ^{2}\right ) = \mathbb{E}\left (\max \limits _{ 1\leq j\leq p}[\boldsymbol{Y }-\boldsymbol{\theta }]_{j:p}^{T}\boldsymbol{P}_{\boldsymbol{ X}}[\boldsymbol{Y }-\boldsymbol{\theta }]_{j:p}\right ) {}\\ & & = \mathbb{E}\left (\max \limits _{1\leq j\leq p}\mathrm{tr}\left (\boldsymbol{D\varGamma }^{T}[\boldsymbol{Y }-\boldsymbol{\theta }]_{ j:p}\left (\boldsymbol{\varGamma }^{T}[\boldsymbol{Y }-\boldsymbol{\theta }]_{ j:p}\right )^{T}\right )\right ) {}\\ & & = \mathbb{E}\left (\max \limits _{1\leq j\leq p}\sum _{l=1}^{k}\left [\boldsymbol{\varGamma }^{T}[\boldsymbol{Y }-\boldsymbol{\theta }]_{ j:p}\right ]_{l}^{2}\right ) {}\\ & & = \mathbb{E}\left (\max \limits _{1\leq j\leq p}\sum _{l=1}^{k}\left (\sum _{ m=j}^{p}\left [\boldsymbol{\varGamma }^{T}\right ]_{ lm}\left (Y _{m} -\theta _{m}\right )\right )^{2}\right ) {}\\ & & \leq \mathbb{E}\left (\sum _{l=1}^{k}\max \limits _{ 1\leq j\leq p}\left (\sum _{m=j}^{p}\left [\boldsymbol{\varGamma }^{T}\right ]_{ lm}\left (Y _{m} -\theta _{m}\right )\right )^{2}\right ) {}\\ & & =\sum _{ l=1}^{k}\mathbb{E}\left (\max \limits _{ 1\leq j\leq p}\left (\sum _{m=j}^{p}\left [\boldsymbol{\varGamma }^{T}\right ]_{ lm}\left (Y _{m} -\theta _{m}\right )\right )^{2}\right ). {}\\ \end{array}$$

For each l, \(M_{j}^{\left (l\right )} =\sum _{ m=p-j+1}^{p}\left [\boldsymbol{\varGamma }^{T}\right ]_{lm}\left (Y _{m} -\theta _{m}\right )\) forms a martingale, so by Doob’s L p maximum inequality,

$$\displaystyle\begin{array}{rcl} \mathbb{E}\left (\max \limits _{1\leq j\leq p}\left (M_{j}^{\left (l\right )}\right )^{2}\right )& \leq & 4\mathbb{E}\left (M_{ p}^{\left (l\right )}\right )^{2} = 4\mathbb{E}\left (\sum _{ m=1}^{p}\left [\boldsymbol{\varGamma }^{T}\right ]_{ lm}\left (Y _{m} -\theta _{m}\right )\right )^{2} {}\\ & =& 4\sum _{m=1}^{p}\left [\boldsymbol{\varGamma }^{T}\right ]_{ lm}^{2}A_{ m} = 4\left [\boldsymbol{\varGamma }^{T}\boldsymbol{A\varGamma }\right ]_{ ll}. {}\\ \end{array}$$

Therefore,

$$\displaystyle\begin{array}{rcl} & & \mathbb{E}\left (\max \limits _{1\leq j\leq p}\left \Vert \boldsymbol{P}_{\boldsymbol{X}}[\boldsymbol{Y }-\boldsymbol{\theta }]_{j:p}\right \Vert ^{2}\right ) \leq \sum _{ l=1}^{k}4\left [\boldsymbol{\varGamma }^{T}\boldsymbol{A\varGamma }\right ]_{ ll} {}\\ & & = 4\sum _{l=1}^{p}\left [\boldsymbol{D}\right ]_{ ll}\left [\boldsymbol{\varGamma }^{T}\boldsymbol{A\varGamma }\right ]_{ ll} = 4\ \mathrm{tr}\left (\boldsymbol{D\varGamma }^{T}\boldsymbol{A\varGamma }\right ) = 4\ \mathrm{tr}\left (\boldsymbol{P}_{\boldsymbol{ X}}\boldsymbol{A}\right ) {}\\ & & = 4\ \mathrm{tr}\left (\boldsymbol{X}^{T}\left (\boldsymbol{XX}^{T}\right )^{-1}\boldsymbol{XA}\right ) = 4\ \mathrm{tr}\left (\left (\boldsymbol{XX}^{T}\right )^{-1}\boldsymbol{XAX}^{T}\right ) = O\left (1\right ), {}\\ \end{array}$$

where the last equality uses conditions \(\left (\mathrm{D}\right )\) and \(\left (\mathrm{E}\right )\). We finally obtain

$$\displaystyle{\mathbb{E}\left (\sup \limits _{0\leq \lambda \leq \infty,\;\boldsymbol{\mu }\in \mathcal{L}}\left \vert \left (\mathrm{III}\right )\right \vert \right ) \leq o\left (p^{-1/2}\right ) \times O\left (p^{1/2}\right ) \times O\left (1\right ) = o\left (1\right ).}$$

Proof for the case of Model II.

Under Model II, we know that

$$\displaystyle{\sum _{i=1}^{p}A_{ i}\theta _{i}^{2} =\boldsymbol{\theta } ^{T}\boldsymbol{A\theta } =\boldsymbol{\beta } ^{T}(\boldsymbol{XAX}^{T})\boldsymbol{\beta } = O\left (p\right )}$$

by condition \(\left (\mathrm{D}\right )\). In other words, condition \(\left (\mathrm{D}\right )\) implies condition \(\left (\mathrm{B}\right )\). Therefore, we know that the term \(\left (\mathrm{I}\right ) \rightarrow 0\) in L 2 as shown in Theorem 3.1 of [29], and we only need to show the uniform L 1 convergence of the other two terms, \(\left (\mathrm{II}\right )\) and \(\left (\mathrm{III}\right )\).

Recall that \(\boldsymbol{B} \in \mathcal{B} = \left \{\lambda \boldsymbol{X}^{T}\boldsymbol{WX}:\lambda > 0\right \}\) has only rank k under Model II. We can reexpress \(\left (\mathrm{II}\right )\) and \(\left (\mathrm{III}\right )\) in terms of low rank matrices. Let \(\boldsymbol{V } = \left (\boldsymbol{XA}^{-1}\boldsymbol{X}^{T}\right )^{-1}\). Woodbury formula gives

$$\displaystyle\begin{array}{rcl} \left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}& =& \left (\boldsymbol{A} +\lambda \boldsymbol{ X}^{T}\boldsymbol{WX}\right )^{-1} =\boldsymbol{ A}^{-1} -\boldsymbol{ A}^{-1}\lambda \boldsymbol{X}^{T}\left (\boldsymbol{W}^{-1} +\lambda \boldsymbol{ V }^{-1}\right )^{-1}\boldsymbol{XA}^{-1} {}\\ & =& \boldsymbol{A}^{-1} -\boldsymbol{ A}^{-1}\lambda \boldsymbol{X}^{T}\boldsymbol{W}\left (\lambda \boldsymbol{W} +\boldsymbol{ V }\right )^{-1}\boldsymbol{V XA}^{-1}, {}\\ \end{array}$$

which tells us

$$\displaystyle{\boldsymbol{B}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1} =\boldsymbol{ I}_{ p} -\boldsymbol{ A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1} =\lambda \boldsymbol{ X}^{T}\boldsymbol{W}\left (\lambda \boldsymbol{W} +\boldsymbol{ V }\right )^{-1}\boldsymbol{V XA}^{-1}.}$$

Let \(\boldsymbol{U}\boldsymbol{\varLambda U}^{T}\) be the spectral decomposition of \(\boldsymbol{W}^{-1/2}\boldsymbol{V W}^{-1/2}\), i.e., \(\boldsymbol{W}^{-1/2}\boldsymbol{V W}^{-1/2} =\boldsymbol{ U\varLambda }\boldsymbol{U}^{T}\), where \(\boldsymbol{\varLambda }=\mathrm{ diag}\left (d_{1},\ldots,d_{k}\right )\) with d 1 ≤ ⋯ ≤ d k . Then \(\left (\lambda \boldsymbol{W} +\boldsymbol{ V }\right )^{-1} =\boldsymbol{ W}^{-1/2}\left (\lambda \boldsymbol{I}_{k} +\boldsymbol{ W}^{-1/2}\boldsymbol{V W}^{-1/2}\right )^{-1}\boldsymbol{W}^{-1/2} =\boldsymbol{ W}^{-1/2}\boldsymbol{U}\left (\lambda \boldsymbol{I}_{k}+\boldsymbol{\varLambda }\right )^{-1}\boldsymbol{U}^{T}\boldsymbol{W}^{-1/2}\), from which we obtain

$$\displaystyle{\boldsymbol{B}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1} =\lambda \boldsymbol{ X}^{T}\boldsymbol{W}\left (\lambda \boldsymbol{W} +\boldsymbol{ V }\right )^{-1}\boldsymbol{V XA}^{-1} =\lambda \boldsymbol{ X}^{T}\boldsymbol{W}^{1/2}\boldsymbol{U}\left (\lambda \boldsymbol{I}_{ k}+\boldsymbol{\varLambda }\right )^{-1}\boldsymbol{\varLambda U}^{T}\boldsymbol{W}^{1/2}\boldsymbol{XA}^{-1}.}$$

If we denote \(\boldsymbol{Z} =\boldsymbol{ U}^{T}\boldsymbol{W}^{1/2}\boldsymbol{X}\), i.e., \(\boldsymbol{Z}\) is the transformed covariate matrix, then \(\boldsymbol{B}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1} =\lambda \boldsymbol{ Z}^{T}\left (\lambda \boldsymbol{I}_{k}+\boldsymbol{\varLambda }\right )^{-1}\boldsymbol{\varLambda }\boldsymbol{Z}\boldsymbol{A}^{-1}\). It follows that

$$\displaystyle\begin{array}{rcl} \left (\mathrm{II}\right )& =& -\dfrac{2} {p}\mathrm{tr}\left (\boldsymbol{B}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\left (\boldsymbol{Y Y }^{T} -\boldsymbol{ Y \theta }^{T} -\boldsymbol{ A}\right )\right ) {}\\ & =& -\dfrac{2} {p}\mathrm{tr}\left (\lambda \boldsymbol{Z}^{T}\left (\lambda \boldsymbol{I}_{ k}+\boldsymbol{\varLambda }\right )^{-1}\boldsymbol{\varLambda }\boldsymbol{Z}\boldsymbol{A}^{-1}\left (\boldsymbol{Y Y }^{T} -\boldsymbol{ Y \theta }^{T} -\boldsymbol{ A}\right )\right ) {}\\ & =& -\dfrac{2} {p}\mathrm{tr}\left (\lambda \left (\lambda \boldsymbol{I}_{k}+\boldsymbol{\varLambda }\right )^{-1}\boldsymbol{\varLambda }\boldsymbol{Z}\boldsymbol{A}^{-1}\left (\boldsymbol{Y Y }^{T} -\boldsymbol{ Y \theta }^{T} -\boldsymbol{ A}\right )\boldsymbol{Z}^{T}\right ), {}\\ \left (\mathrm{III}\right )& =& -\dfrac{2} {p}\mathrm{tr}\left (\boldsymbol{A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\boldsymbol{\mu }\left (\boldsymbol{Y }-\boldsymbol{\theta }\right )^{T}\right ) {}\\ & =& -\dfrac{2} {p}\mathrm{tr}\left (\left (\boldsymbol{I}_{p} -\lambda \boldsymbol{ Z}^{T}\left (\lambda \boldsymbol{I}_{ k}+\boldsymbol{\varLambda }\right )^{-1}\boldsymbol{\varLambda }\boldsymbol{Z}\boldsymbol{A}^{-1}\right )\boldsymbol{\mu }\left (\boldsymbol{Y }-\boldsymbol{\theta }\right )^{T}\right ) {}\\ & =& -\dfrac{2} {p}\mathrm{tr}\left (\boldsymbol{\mu }\left (\boldsymbol{Y }-\boldsymbol{\theta }\right )^{T}\right ) + \dfrac{2} {p}\mathrm{tr}\left (\lambda \left (\lambda \boldsymbol{I}_{k}+\boldsymbol{\varLambda }\right )^{-1}\boldsymbol{\varLambda }\boldsymbol{Z}\boldsymbol{A}^{-1}\boldsymbol{\mu }\left (\boldsymbol{Y }-\boldsymbol{\theta }\right )^{T}\boldsymbol{Z}^{T}\right ) {}\\ & =& \left (\mathrm{III}\right )_{1} + \left (\mathrm{III}\right )_{2}. {}\\ \end{array}$$

We will next show that \(\left (\mathrm{II}\right )\), \(\left (\mathrm{III}\right )_{1}\), and \(\left (\mathrm{III}\right )_{2}\) all uniformly converge to zero in L 1, which will then complete our proof.

Let \(\boldsymbol{\varXi }=\boldsymbol{ ZA}^{-1}\left (\boldsymbol{Y Y }^{T} -\boldsymbol{ Y \theta }^{T} -\boldsymbol{ A}\right )\boldsymbol{Z}^{T}\). Then

$$\displaystyle\begin{array}{rcl} \sup \limits _{0\leq \lambda \leq \infty }\left \vert \left (\mathrm{II}\right )\right \vert & =& \dfrac{2} {p}\sup \limits _{0\leq \lambda \leq \infty }\left \vert \sum _{i=1}^{k} \dfrac{\lambda d_{i}} {\lambda +d_{i}}\left [\boldsymbol{\varXi }\right ]_{ii}\right \vert {}\\ & \leq & \dfrac{2} {p}\sup \limits _{0\leq c_{1}\leq \cdots \leq c_{k}\leq d_{k}}\left \vert \sum _{i=1}^{k}c_{ i}\left [\boldsymbol{\varXi }\right ]_{ii}\right \vert = \dfrac{2} {p}\max \limits _{1\leq j\leq k}\left \vert \sum _{i=j}^{k}d_{ k}\left [\boldsymbol{\varXi }\right ]_{ii}\right \vert, {}\\ \end{array}$$

where the last equality follows as in Lemma 2.1 of [13]. As there are finite number of terms in the summation and the maximization, it suffices to show that

$$\displaystyle{d_{k}\left [\boldsymbol{\varXi }\right ]_{ii}/p \rightarrow 0\text{ in }L^{2}\ \ \ \ \text{for all }1 \leq i \leq k.}$$

To establish this, we note that \(\left [\boldsymbol{\varXi }\right ]_{ii} =\sum _{ n=1}^{p}\sum _{m=1}^{p}\left (A_{n}^{-1}Y _{n}\left (Y _{m} -\theta _{m}\right ) -\delta _{nm}\right )\left [\boldsymbol{Z}\right ]_{in}\left [\boldsymbol{Z}\right ]_{im}\),

$$\displaystyle\begin{array}{rcl} \mathbb{E}\left (\left [\boldsymbol{\varXi }\right ]_{ii}^{2}\right )& =& \sum _{ n,m,n^{{\prime}},m^{{\prime}}}\mathbb{E}\left (\left (A_{n}^{-1}Y _{ n}\left (Y _{m} -\theta _{m}\right ) -\delta _{nm}\right )\left (A_{n^{{\prime}}}^{-1}Y _{ n^{{\prime}}}\left (Y _{m^{{\prime}}}-\theta _{m^{{\prime}}}\right ) -\delta _{n^{{\prime}}m^{{\prime}}}\right )\right ) {}\\ & & \times \left [\boldsymbol{Z}\right ]_{in}\left [\boldsymbol{Z}\right ]_{im}\left [\boldsymbol{Z}\right ]_{in^{{\prime}}}\left [\boldsymbol{Z}\right ]_{im^{{\prime}}}. {}\\ \end{array}$$

Depending on n, m, n ′, m ′ taking the same or distinct values, we can break the summation into 15 disjoint cases:

$$\displaystyle\begin{array}{rcl} & & \sum _{\text{all distinct}} +\sum _{\text{three distinct, }n=m} +\sum _{\text{three distinct, }n=n^{{\prime}}} +\sum _{\text{three distinct, }n=m^{{\prime}}} {}\\ & +& \sum _{\text{three distinct, }m=n^{{\prime}}} +\sum _{\text{three distinct, }m=m^{{\prime}}} +\sum _{\text{three distinct, }n^{{\prime}}=m^{{\prime}}} +\sum _{\text{two distinct, }n=m\text{, }n^{{\prime}}=m^{{\prime}}} {}\\ {}\\ {}\\ & +& \sum _{\text{two distinct, }n=n^{{\prime}}\text{, }m=m^{{\prime}}} +\sum _{\text{two distinct, }n=m^{{\prime}}\text{, }n^{{\prime}}=m} +\sum _{\text{two distinct, }n=m=n^{{\prime}}} +\sum _{\text{two distinct, }n=m=m^{{\prime}}} {}\\ & +& \sum _{\text{two distinct, }n=n^{{\prime}}=m^{{\prime}}} +\sum _{\text{two distinct, }m=n^{{\prime}}=m^{{\prime}}} +\sum _{n=m=n^{{\prime}}=m^{{\prime}}}. {}\\ \end{array}$$

Many terms are zero. Straightforward evaluation of each summation gives

$$\displaystyle\begin{array}{rcl} \mathbb{E}\left (\left [\boldsymbol{\varXi }\right ]_{ii}^{2}\right )& =& \sum _{ n=1}^{p}\mathbb{E}\left (\left (A_{ n}^{-1}Y _{ n}\left (Y _{n} -\theta _{n}\right ) - 1\right )^{2}\right )\left [\boldsymbol{Z}\right ]_{ in}^{4} {}\\ & & +\sum _{n=1}^{p}\sum _{ m\neq n}\mathbb{E}\left (\left (A_{n}^{-1}Y _{ n}\left (Y _{m} -\theta _{m}\right )\right )^{2}\right )\left [\boldsymbol{Z}\right ]_{ in}^{2}\left [\boldsymbol{Z}\right ]_{ im}^{2} {}\\ & & +\sum _{n=1}^{p}\sum _{ m\neq n}\mathbb{E}\left (\left (A_{n}^{-1}Y _{ n}\left (Y _{m} -\theta _{m}\right )\right )\left (A_{m}^{-1}Y _{ m}\left (Y _{n} -\theta _{n}\right )\right )\right )\left [\boldsymbol{Z}\right ]_{in}^{2}\left [\boldsymbol{Z}\right ]_{ im}^{2} {}\\ & & +2\sum _{n=1}^{p}\sum _{ m\neq n}\mathbb{E}\left (\left (A_{n}^{-1}Y _{ n}\left (Y _{n} -\theta _{n}\right ) - 1\right )\left (A_{m}^{-1}Y _{ m}\left (Y _{n} -\theta _{n}\right )\right )\right )\left [\boldsymbol{Z}\right ]_{in}^{3}\left [\boldsymbol{Z}\right ]_{ im} {}\\ & & +\sum _{n=1}^{p}\sum _{ m\neq n^{{\prime}},n^{{\prime}}\neq n,m\neq n}\mathbb{E}\left (\left (A_{m}^{-1}Y _{ m}\left (Y _{n} -\theta _{n}\right )\right )\left (A_{n^{{\prime}}}^{-1}Y _{ n^{{\prime}}}\left (Y _{n} -\theta _{n}\right )\right )\right )\left [\boldsymbol{Z}\right ]_{in}^{2}\left [\boldsymbol{Z}\right ]_{ im}\left [\boldsymbol{Z}\right ]_{in^{{\prime}}} {}\\ & =& \sum _{n=1}^{p}\dfrac{2A_{n} +\theta _{ n}^{2}} {A_{n}} \left [\boldsymbol{Z}\right ]_{in}^{4} +\sum _{ n=1}^{p}\sum _{ m\neq n}\dfrac{A_{n}A_{m} + A_{n}\theta _{m}^{2}} {A_{m}^{2}} \left [\boldsymbol{Z}\right ]_{in}^{2}\left [\boldsymbol{Z}\right ]_{ im}^{2} +\sum _{ n=1}^{p}\sum _{ m\neq n}\left [\boldsymbol{Z}\right ]_{in}^{2}\left [\boldsymbol{Z}\right ]_{ im}^{2} {}\\ & & +2\sum _{n=1}^{p}\sum _{ m\neq n} \dfrac{\theta _{n}\theta _{m}} {A_{m}}\left [\boldsymbol{Z}\right ]_{in}^{3}\left [\boldsymbol{Z}\right ]_{ im} +\sum _{ n=1}^{p}\sum _{ m\neq n^{{\prime}},n^{{\prime}}\neq n,m\neq n} \dfrac{A_{n}\theta _{m}\theta _{n^{{\prime}}}} {A_{m}A_{n^{{\prime}}}}\left [\boldsymbol{Z}\right ]_{in}^{2}\left [\boldsymbol{Z}\right ]_{ im}\left [\boldsymbol{Z}\right ]_{in^{{\prime}}} {}\\ & =& \sum _{n,m=1}^{p} \dfrac{A_{n}} {A_{m}}\left [\boldsymbol{Z}\right ]_{in}^{2}\left [\boldsymbol{Z}\right ]_{ im}^{2} +\sum _{ n,m=1}^{p}\left [\boldsymbol{Z}\right ]_{ in}^{2}\left [\boldsymbol{Z}\right ]_{ im}^{2} +\sum _{ n,m,n^{{\prime}}=1}^{p} \dfrac{A_{n}\theta _{m}\theta _{n^{{\prime}}}} {A_{m}A_{n^{{\prime}}}}\left [\boldsymbol{Z}\right ]_{in}^{2}\left [\boldsymbol{Z}\right ]_{ im}\left [\boldsymbol{Z}\right ]_{in^{{\prime}}}.{}\\ \end{array}$$

Using matrix notation, we can reexpress the above equation as

$$\displaystyle\begin{array}{rcl} \mathbb{E}\left (\left [\boldsymbol{\varXi }\right ]_{ii}^{2}\right )& =& \left [\boldsymbol{ZAZ}^{T}\right ]_{ ii}\left [\boldsymbol{ZA}^{-1}\boldsymbol{Z}^{T}\right ]_{ ii} + \left [\boldsymbol{ZZ}^{T}\right ]_{ ii}^{2} + \left [\boldsymbol{ZAZ}^{T}\right ]_{ ii}\left [\boldsymbol{ZA}^{-1}\boldsymbol{\theta }\right ]_{ i}^{2} {}\\ & \leq & \mathrm{tr}\left (\boldsymbol{ZAZ}^{T}\right )\mathrm{tr}\left (\boldsymbol{ZA}^{-1}\boldsymbol{Z}^{T}\right ) +\mathrm{ tr}\left (\boldsymbol{ZZ}^{T}\right )^{2} +\mathrm{ tr}\left (\boldsymbol{ZAZ}^{T}\right )\mathrm{tr}\left (\boldsymbol{\theta }^{T}\boldsymbol{A}^{-1}\boldsymbol{Z}^{T}\boldsymbol{ZA}^{-1}\boldsymbol{\theta }\right ) {}\\ & =& \mathrm{tr}\left (\boldsymbol{WXAX}^{T}\right )\mathrm{tr}\left (\boldsymbol{WXA}^{-1}\boldsymbol{X}^{T}\right ) +\mathrm{ tr}\left (\boldsymbol{WXX}^{T}\right )^{2} {}\\ & & +\mathrm{tr}\left (\boldsymbol{WXAX}^{T}\right )\mathrm{tr}\left (\boldsymbol{\beta }^{T}\left (\boldsymbol{XA}^{-1}\boldsymbol{X}^{T}\right )\boldsymbol{W}\left (\boldsymbol{XA}^{-1}\boldsymbol{X}^{T}\right )\boldsymbol{\beta }\right ), {}\\ \end{array}$$

which is \(O\left (p\right )O\left (p\right ) + O\left (p\right )^{2} + O\left (p\right )O\left (p^{2}\right ) = O\left (p^{3}\right )\) by conditions \(\left (\mathrm{D}\right )\)-\(\left (\mathrm{F}\right )\). Note also that condition \(\left (\mathrm{F}\right )\) implies

$$\displaystyle{d_{k} \leq \sum _{i=1}^{k}d_{ i} =\mathrm{ tr}\left (\boldsymbol{W}^{-1/2}\boldsymbol{V W}^{-1/2}\right ) =\mathrm{ tr}\left (\boldsymbol{W}^{-1}\boldsymbol{V }\right ) =\mathrm{ tr}\left (\boldsymbol{W}^{-1}(\boldsymbol{XA}^{-1}\boldsymbol{X}^{T})^{-1}\right ) = O\left (p^{-1}\right ).}$$

Therefore, we have

$$\displaystyle{\mathbb{E}\left (d_{k}^{2}\left [\boldsymbol{\varXi }\right ]_{ ii}^{2}/p^{2}\right ) = O\left (p^{-2}\right )O\left (p^{3}\right )/p^{2} = O\left (p^{-1}\right ) \rightarrow 0,}$$

which proves

$$\displaystyle{\sup \limits _{0\leq \lambda \leq \infty }\left \vert \left (\mathrm{II}\right )\right \vert \rightarrow 0\text{ in }L^{2},\ \ \ \ \text{as }p \rightarrow \infty.}$$

To prove the uniform convergence of \(\left (\mathrm{III}\right )_{1}\) to zero in L 1, we note that

$$\displaystyle\begin{array}{rcl} \sup \limits _{\boldsymbol{\mu }\in \mathcal{L}}\left \vert \left (\mathrm{III}\right )_{1}\right \vert & =& \dfrac{2} {p}\sup \limits _{\boldsymbol{\mu }\in \mathcal{L}}\left \vert \boldsymbol{\mu }^{T}\left (\boldsymbol{Y }-\boldsymbol{\theta }\right )\right \vert = \dfrac{2} {p}\sup \limits _{\boldsymbol{\mu }\in \mathcal{L}}\left \vert \boldsymbol{\mu }^{T}\boldsymbol{P}_{\boldsymbol{ X}}\left (\boldsymbol{Y }-\boldsymbol{\theta }\right )\right \vert {}\\ & \leq & \dfrac{2} {p}\sup \limits _{\boldsymbol{\mu }\in \mathcal{L}}\left \Vert \boldsymbol{\mu }\right \Vert \times \left \Vert \boldsymbol{P}_{\boldsymbol{X}}\left (\boldsymbol{Y }-\boldsymbol{\theta }\right )\right \Vert = \dfrac{2} {p}Mp^{\kappa }\left \Vert \boldsymbol{Y }\right \Vert \times \left \Vert \boldsymbol{P}_{\boldsymbol{X}}\left (\boldsymbol{Y }-\boldsymbol{\theta }\right )\right \Vert, {}\\ \end{array}$$

so by Cauchy-Schwarz inequality

$$\displaystyle{ \mathbb{E}\left (\sup \limits _{\boldsymbol{\mu }\in \mathcal{L}}\left \vert \left (\mathrm{III}\right )_{1}\right \vert \right ) \leq 2Mp^{\kappa -1}\sqrt{\mathbb{E}\left (\left \Vert \boldsymbol{Y } \right \Vert ^{2 } \right )}\sqrt{\mathbb{E}\left (\left \Vert \boldsymbol{P}_{\boldsymbol{ X}}\left (\boldsymbol{Y }-\boldsymbol{\theta }\right )\right \Vert ^{2}\right )}. }$$
(21)

Under Model II, \(\boldsymbol{\theta }=\boldsymbol{ X}^{T}\boldsymbol{\beta }\), so it follows that \(\sum _{i=1}^{p}\theta _{i}^{2} = \left \Vert \boldsymbol{\theta }\right \Vert ^{2} =\mathrm{ tr}\left (\boldsymbol{\beta \beta }^{T}\boldsymbol{XX}^{T}\right ) = O\left (p\right )\) by condition \(\left (\mathrm{E}\right )\). Hence \(\sqrt{\mathbb{E}\left (\left \Vert \boldsymbol{Y } \right \Vert ^{2 } \right )} = \sqrt{\sum \nolimits _{i=1 }^{p }\left (\theta _{i }^{2 } + A_{i } \right )} = O\left (p^{1/2}\right )\). For the second term on the right-hand side of (21), note that

$$\displaystyle\begin{array}{rcl} \mathbb{E}\left (\left \Vert \boldsymbol{P}_{\boldsymbol{X}}\left (\boldsymbol{Y }-\boldsymbol{\theta }\right )\right \Vert ^{2}\right )& =& \mathbb{E}\left (\mathrm{tr}\left (\boldsymbol{P}_{\boldsymbol{ X}}\left (\boldsymbol{Y }-\boldsymbol{\theta }\right )\left (\boldsymbol{Y }-\boldsymbol{\theta }\right )^{T}\right )\right ) {}\\ & =& \mathrm{tr}\left (\boldsymbol{P}_{\boldsymbol{X}}\boldsymbol{A}\right ) =\mathrm{ tr}\left (\left (\boldsymbol{XX}^{T}\right )^{-1}\boldsymbol{XAX}^{T}\right ) = O\left (1\right ) {}\\ \end{array}$$

by conditions \(\left (\mathrm{D}\right )\) and \(\left (\mathrm{E}\right )\). Thus, in aggregate, we have

$$\displaystyle{\mathbb{E}\left (\sup \limits _{\boldsymbol{\mu }\in \mathcal{L}}\left \vert \left (\mathrm{III}\right )_{1}\right \vert \right ) \leq 2Mp^{\kappa -1}O\left (p^{1/2}\right )O\left (1\right ) = o\left (1\right ).}$$

We finally consider the \(\left (\mathrm{III}\right )_{2}\) term. We have

$$\displaystyle\begin{array}{rcl} \sup \limits _{0\leq \lambda \leq \infty,\;\boldsymbol{\mu }\in \mathcal{L}}\left \vert \left (\mathrm{III}\right )_{2}\right \vert & =& \dfrac{2} {p}\sup \limits _{\boldsymbol{\mu }\in \mathcal{L}}\sup \limits _{0\leq \lambda \leq \infty }\left \vert \sum _{i=1}^{k} \dfrac{\lambda d_{i}} {\lambda +d_{i}}\left [\boldsymbol{ZA}^{-1}\boldsymbol{\mu }\left (\boldsymbol{Y }-\boldsymbol{\theta }\right )^{T}\boldsymbol{Z}^{T}\right ]_{ ii}\right \vert {}\\ & \leq & \dfrac{2} {p}\sup \limits _{\boldsymbol{\mu }\in \mathcal{L}}\max \limits _{1\leq j\leq k}\left \vert \sum _{i=j}^{k}d_{ k}\left [\boldsymbol{ZA}^{-1}\boldsymbol{\mu }\left (\boldsymbol{Y }-\boldsymbol{\theta }\right )^{T}\boldsymbol{Z}^{T}\right ]_{ ii}\right \vert {}\\ & \leq & \dfrac{2d_{k}} {p} \sup \limits _{\boldsymbol{\mu }\in \mathcal{L}}\sum _{i=1}^{k}\left \vert \left [\boldsymbol{ZA}^{-1}\boldsymbol{\mu }\left (\boldsymbol{Y }-\boldsymbol{\theta }\right )^{T}\boldsymbol{Z}^{T}\right ]_{ ii}\right \vert {}\\ & =& \dfrac{2d_{k}} {p} \sup \limits _{\boldsymbol{\mu }\in \mathcal{L}}\sum _{i=1}^{k}\left \vert \left [\boldsymbol{ZA}^{-1}\boldsymbol{\mu }\right ]_{ i}\left [\boldsymbol{Z}\left (\boldsymbol{Y }-\boldsymbol{\theta }\right )\right ]_{i}\right \vert {}\\ & \leq & \dfrac{2d_{k}} {p} \sup \limits _{\boldsymbol{\mu }\in \mathcal{L}}\sqrt{\sum _{i=1 }^{k }\left [\boldsymbol{ZA}^{-1 } \boldsymbol{\mu } \right ] _{i }^{2}} \times \sqrt{\sum _{i=1 }^{k }\left [\boldsymbol{Z}\left (\boldsymbol{Y }-\boldsymbol{\theta } \right ) \right ] _{i }^{2}}. {}\\ \end{array}$$

Thus, by Cauchy-Schwarz inequality

$$\displaystyle{\mathbb{E}\left (\sup \limits _{0\leq \lambda \leq \infty,\;\boldsymbol{\mu }\in \mathcal{L}}\left \vert \left (\mathrm{III}\right )_{2}\right \vert \right ) \leq \dfrac{2d_{k}} {p} \sqrt{\mathbb{E}\left (\sup \limits _{\boldsymbol{\mu }\in \mathcal{L} } \sum _{i=1 }^{k }\left [\boldsymbol{ZA}^{-1 } \boldsymbol{\mu } \right ] _{i }^{2 } \right )} \times \sqrt{\mathbb{E}\left (\sum _{i=1 }^{k }\left [\boldsymbol{Z}\left (\boldsymbol{Y }-\boldsymbol{\theta } \right ) \right ] _{i }^{2 } \right )}.}$$

Note that

$$\displaystyle\begin{array}{rcl} & & \sup \limits _{\boldsymbol{\mu }\in \mathcal{L}}\sum _{i=1}^{k}\left [\boldsymbol{ZA}^{-1}\boldsymbol{\mu }\right ]_{ i}^{2} =\sup \limits _{\boldsymbol{\mu } \in \mathcal{L}}\sum _{i=1}^{k}\left (\sum _{ m=1}^{p}\left [\boldsymbol{ZA}^{-1}\right ]_{ im}\left [\boldsymbol{\mu }\right ]_{m}\right )^{2} {}\\ & & \leq \sup \limits _{\boldsymbol{\mu }\in \mathcal{L}}\sum _{i=1}^{k}\left (\sum _{ m=1}^{p}\left [\boldsymbol{ZA}^{-1}\right ]_{ im}^{2} \times \sum _{ m=1}^{p}\left [\boldsymbol{\mu }\right ]_{ m}^{2}\right ) =\sup \limits _{\boldsymbol{\mu } \in \mathcal{L}}\sum _{i=1}^{k}\left (\left [\boldsymbol{ZA}^{-2}\boldsymbol{Z}^{T}\right ]_{ ii}\left \Vert \boldsymbol{\mu }\right \Vert ^{2}\right ) {}\\ & & =\mathrm{ tr}\left (\boldsymbol{ZA}^{-2}\boldsymbol{Z}^{T}\right )\sup \limits _{\boldsymbol{\mu } \in \mathcal{L}}\left \Vert \boldsymbol{\mu }\right \Vert ^{2} =\mathrm{ tr}\left (\boldsymbol{WXA}^{-2}\boldsymbol{X}^{T}\right )\left (Mp^{\kappa }\left \Vert \boldsymbol{Y }\right \Vert \right )^{2} = o\left (p^{2}\right )\left \Vert \boldsymbol{Y }\right \Vert ^{2}, {}\\ \end{array}$$

where the last equality uses condition \(\left (\mathrm{G}\right )\). Thus,

$$\displaystyle{\mathbb{E}\left (\sup \limits _{\boldsymbol{\mu }\in \mathcal{L}}\sum _{i=1}^{k}\left [\boldsymbol{ZA}^{-1}\boldsymbol{\mu }\right ]_{ i}^{2}\right ) = o\left (p^{3}\right ).}$$

Also note that

$$\displaystyle\begin{array}{rcl} \mathbb{E}\left (\sum _{i=1}^{k}\left [\boldsymbol{Z}\left (\boldsymbol{Y }-\boldsymbol{\theta }\right )\right ]_{ i}^{2}\right )& =& \mathbb{E}\left (\mathrm{tr}\left (\boldsymbol{Z}^{T}\boldsymbol{Z}\left (\boldsymbol{Y }-\boldsymbol{\theta }\right )\left (\boldsymbol{Y }-\boldsymbol{\theta }\right )^{T}\right )\right ) {}\\ & =& \mathrm{tr}\left (\boldsymbol{Z}^{T}\boldsymbol{ZA}\right ) =\mathrm{ tr}\left (\boldsymbol{WXAX}^{T}\right ) = O\left (p\right ) {}\\ \end{array}$$

by condition \(\left (\mathrm{D}\right )\). Recall that \(d_{k} = O\left (p^{-1}\right )\) by condition \(\left (\mathrm{F}\right )\). It follows that

$$\displaystyle{\mathbb{E}\left (\sup \limits _{0\leq \lambda \leq \infty,\;\boldsymbol{\mu }\in \mathcal{L}}\left \vert \left (\mathrm{III}\right )_{2}\right \vert \right ) \leq \dfrac{2} {p}O\left (p^{-1}\right )o\left (p^{3/2}\right )O\left (p^{1/2}\right ) = o\left (1\right ),}$$

which completes our proof.

Proof of Lemma 2

The fact that \(\boldsymbol{\hat{\mu }}^{\mathrm{OLS}} \in \mathcal{L}\) is trivial as

$$\displaystyle{\boldsymbol{\hat{\mu }}^{\mathrm{OLS}} =\boldsymbol{ X}^{T}\left (\boldsymbol{XX}^{T}\right )^{-1}\boldsymbol{XY } =\boldsymbol{ P}_{\boldsymbol{ X}}\boldsymbol{Y },}$$

while the projection matrix \(\boldsymbol{P}_{\boldsymbol{X}}\) has induced matrix 2-norm \(\left \Vert \boldsymbol{P}_{\boldsymbol{X}}\right \Vert _{2} = 1\). Thus, \(\left \Vert \boldsymbol{\hat{\mu }}^{\mathrm{OLS}}\right \Vert \leq \left \Vert \boldsymbol{P}_{\boldsymbol{X}}\right \Vert _{2}\left \Vert \boldsymbol{Y }\right \Vert = \left \Vert \boldsymbol{Y }\right \Vert\). For \(\boldsymbol{\hat{\mu }}^{\mathrm{WLS}}\), note that

$$\displaystyle\begin{array}{rcl} \boldsymbol{\hat{\mu }}^{\mathrm{WLS}}& =& \boldsymbol{X}^{T}\left (\boldsymbol{XA}^{-1}\boldsymbol{X}^{T}\right )^{-1}\boldsymbol{XA}^{-1}\boldsymbol{Y } {}\\ & =& \boldsymbol{A}^{1/2}\left (\boldsymbol{XA}^{-1/2}\right )^{T}\left (\boldsymbol{XA}^{-1/2}\left (\boldsymbol{XA}^{-1/2}\right )^{T}\right )^{-1}\left (\boldsymbol{XA}^{-1/2}\right )\boldsymbol{A}^{-1/2}\boldsymbol{Y } {}\\ & =& \boldsymbol{A}^{1/2}\left (\boldsymbol{P}_{\boldsymbol{ XA}^{-1/2}}\right )\boldsymbol{A}^{-1/2}\boldsymbol{Y }, {}\\ \end{array}$$

where \(\boldsymbol{P}_{\boldsymbol{XA}^{-1/2}}\) is the ordinary projection matrix onto the row space of \(\boldsymbol{XA}^{-1/2}\) and has induced matrix 2-norm 1. It follows

$$\displaystyle{\left \Vert \boldsymbol{\hat{\mu }}^{\mathrm{WLS}}\right \Vert \leq \left \Vert \boldsymbol{A}^{1/2}\right \Vert _{ 2}\left \Vert \boldsymbol{P}_{\boldsymbol{A}^{-1/2}\boldsymbol{X}}\right \Vert _{2}\left \Vert \boldsymbol{A}^{-1/2}\right \Vert _{ 2}\left \Vert \boldsymbol{Y }\right \Vert =\max \limits _{1\leq i\leq p}A_{i}^{1/2} \times \max \limits _{ 1\leq i\leq p}A_{i}^{-1/2} \times \left \Vert \boldsymbol{Y }\right \Vert.}$$

Condition \(\left (\mathrm{A}\right )\) gives

$$\displaystyle{\max \limits _{1\leq i\leq p}A_{i}^{1/2} = (\max \limits _{ 1\leq i\leq p}A_{i}^{2})^{1/4} \leq (\sum _{ i=1}^{p}A_{ i}^{2})^{1/4} = O\left (p^{1/4}\right ).}$$

Similarly, condition \(\left (\mathrm{A}^{{\prime}}\right )\) gives

$$\displaystyle{\max \limits _{1\leq i\leq p}A_{i}^{-1/2} = (\max \limits _{ 1\leq i\leq p}A_{i}^{-2-\delta })^{1/\left (4+2\delta \right )} \leq (\sum _{ i=1}^{p}A_{ i}^{-2-\delta })^{1/\left (4+2\delta \right )} = O\left (p^{1/\left (4+2\delta \right )}\right ).}$$

We then have proved that

$$\displaystyle{\left \Vert \boldsymbol{\hat{\mu }}^{\mathrm{WLS}}\right \Vert \leq O\left (p^{1/4}\right )O\left (p^{1/\left (4+2\delta \right )}\right )\left \Vert \boldsymbol{Y }\right \Vert = O\left (p^{\kappa }\right )\left \Vert \boldsymbol{Y }\right \Vert.}$$

Proof of Theorem 2

To prove the first assertion, note that

$$\displaystyle{\mathrm{URE}\left (\boldsymbol{\hat{B} }^{\mathrm{URE}},\boldsymbol{\hat{\mu }}^{\mathrm{URE}}\right ) \leq \mathrm{ URE}\left (\boldsymbol{\tilde{B}}^{\mathrm{OL}},\boldsymbol{\tilde{\mu }}^{\mathrm{OL}}\right )}$$

by the definition of \(\boldsymbol{\hat{B} }^{\mathrm{URE}}\) and \(\boldsymbol{\hat{\mu }}^{\mathrm{URE}}\), so Theorem 1 implies that

$$\displaystyle\begin{array}{rcl} & & l_{p}\left (\boldsymbol{\theta },\boldsymbol{\hat{\theta }}^{\mathrm{URE}}\right ) - l_{ p}\left (\boldsymbol{\theta },\boldsymbol{\tilde{\theta }}^{\mathrm{OL}}\right ) \\ & & \leq l_{p}\left (\boldsymbol{\theta },\boldsymbol{\hat{\theta }}^{\mathrm{URE}}\right ) -\mathrm{ URE}\left (\boldsymbol{\hat{B} }^{\mathrm{URE}},\boldsymbol{\hat{\mu }}^{\mathrm{URE}}\right ) +\mathrm{ URE}\left (\boldsymbol{\tilde{B}}^{\mathrm{OL}},\boldsymbol{\tilde{\mu }}^{\mathrm{OL}}\right ) - l_{ p}\left (\boldsymbol{\theta },\boldsymbol{\tilde{\theta }}^{\mathrm{OL}}\right ) \\ & & \leq 2\sup \limits _{\boldsymbol{B}\in \mathcal{B},\;\boldsymbol{\mu }\in \mathcal{L}}\left \vert \mathrm{URE}\left (\boldsymbol{B},\boldsymbol{\mu }\right ) - l_{p}\left (\boldsymbol{\theta },\boldsymbol{\hat{\theta }}^{\boldsymbol{B},\boldsymbol{\mu }}\right )\right \vert \mathop{ \rightarrow }\limits_{ p \rightarrow \infty }0\text{ in }L^{1}\text{ and in probability,}{}\end{array}$$
(22)

where the second inequality uses the condition that \(\boldsymbol{\hat{\mu }}^{\mathrm{URE}} \in \mathcal{L}\). Thus, for any ε > 0,

$$\displaystyle\begin{array}{rcl} & & \mathbb{P}\left (l_{p}\left (\boldsymbol{\theta },\boldsymbol{\hat{\theta }}^{\mathrm{URE}}\right ) \geq l_{ p}\left (\boldsymbol{\theta },\boldsymbol{\tilde{\theta }}^{\mathrm{OL}}\right )+\epsilon \right ) {}\\ & & \leq \mathbb{P}\left (2\sup \limits _{\boldsymbol{B}\in \mathcal{B},\;\boldsymbol{\mu }\in \mathcal{L}}\left \vert \mathrm{URE}\left (\boldsymbol{B},\boldsymbol{\mu }\right ) - l_{p}\left (\boldsymbol{\theta },\boldsymbol{\hat{\theta }}^{\boldsymbol{B},\boldsymbol{\mu }}\right )\right \vert \geq \epsilon \right ) \rightarrow 0. {}\\ \end{array}$$

To prove the second assertion, note that

$$\displaystyle{l_{p}\left (\boldsymbol{\theta },\boldsymbol{\tilde{\theta }}^{\mathrm{OL}}\right ) \leq l_{ p}\left (\boldsymbol{\theta },\boldsymbol{\hat{\theta }}^{\mathrm{URE}}\right )}$$

by the definition of \(\boldsymbol{\tilde{\theta }}^{\mathrm{OL}}\) and the condition \(\boldsymbol{\hat{\mu }}^{\mathrm{URE}} \in \mathcal{L}\). Thus, taking expectations on Eq. (22) easily gives the second assertion.

Proof of Corollary 1

Simply note that

$$\displaystyle{l_{p}\left (\boldsymbol{\theta },\boldsymbol{\tilde{\theta }}^{\mathrm{OL}}\right ) \leq l_{ p}\left (\boldsymbol{\theta },\boldsymbol{\hat{\theta }}^{\boldsymbol{\hat{B} }_{p},\boldsymbol{\hat{\mu }}_{p} }\right )}$$

by the definition of \(\boldsymbol{\tilde{\theta }}^{\mathrm{OL}}\). Thus,

$$\displaystyle{l_{p}\left (\boldsymbol{\theta },\boldsymbol{\hat{\theta }}^{\mathrm{URE}}\right ) - l_{ p}\left (\boldsymbol{\theta },\boldsymbol{\hat{\theta }}^{\boldsymbol{\hat{B} }_{p},\boldsymbol{\hat{\mu }}_{p} }\right ) \leq l_{p}\left (\boldsymbol{\theta },\boldsymbol{\hat{\theta }}^{\mathrm{URE}}\right ) - l_{ p}\left (\boldsymbol{\theta },\boldsymbol{\tilde{\theta }}^{\mathrm{OL}}\right ).}$$

Then Theorem 2 clearly implies the desired result.

Proof of Theorem 3

We observe that

$$\displaystyle\begin{array}{rcl} \mathrm{URE}_{\boldsymbol{M}}\left (\boldsymbol{B}\right ) - l_{p}\left (\boldsymbol{\theta },\boldsymbol{\hat{\theta }}^{\boldsymbol{B},\boldsymbol{\hat{\mu }}^{\boldsymbol{M}} }\right )& =& \mathrm{URE}\left (\boldsymbol{B},\boldsymbol{\hat{\mu }}^{\boldsymbol{M}}\right ) - l_{ p}\left (\boldsymbol{\theta },\boldsymbol{\hat{\theta }}^{\boldsymbol{B},\boldsymbol{\hat{\mu }}^{\boldsymbol{M}} }\right ) {}\\ & & +\dfrac{2} {p}\mathrm{tr}\left (\boldsymbol{A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\boldsymbol{P}_{\boldsymbol{ M},\boldsymbol{X}}\boldsymbol{A}\right ). {}\\ \end{array}$$

Since

$$\displaystyle\begin{array}{rcl} \sup \limits _{\boldsymbol{B}\in \mathcal{B}}\left \vert \mathrm{URE}\left (\boldsymbol{B},\boldsymbol{\hat{\mu }}^{\boldsymbol{M}}\right ) - l_{ p}\left (\boldsymbol{\theta },\boldsymbol{\hat{\theta }}^{\boldsymbol{B},\boldsymbol{\hat{\mu }}^{\boldsymbol{M}} }\right )\right \vert & \leq & \sup \limits _{\boldsymbol{B}\in \mathcal{B},\;\boldsymbol{\mu }\in \mathcal{L}}\left \vert \mathrm{URE}\left (\boldsymbol{B},\boldsymbol{\mu }\right ) - l_{p}\left (\boldsymbol{\theta },\boldsymbol{\hat{\theta }}^{\boldsymbol{B},\boldsymbol{\mu }}\right )\right \vert {}\\ & \rightarrow & 0\text{ in }L^{1} {}\\ \end{array}$$

by Theorem 1, we only need to show that

$$\displaystyle{\sup \limits _{\boldsymbol{B}\in \mathcal{B}}\left \vert \dfrac{1} {p}\mathrm{tr}\left (\boldsymbol{A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\boldsymbol{P}_{\boldsymbol{ M},\boldsymbol{X}}\boldsymbol{A}\right )\right \vert \rightarrow 0\ \ \ \ \text{as }p \rightarrow \infty.}$$

Under Model I,

$$\displaystyle\begin{array}{rcl} \mathrm{tr}\left (\boldsymbol{A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\boldsymbol{P}_{\boldsymbol{ M},\boldsymbol{X}}\boldsymbol{A}\right )& =& \sum _{i=1}^{p} \frac{A_{i}} {A_{i}+\lambda }[\boldsymbol{P}_{\boldsymbol{M},\boldsymbol{X}}\boldsymbol{A}]_{ii} {}\\ & \leq & \left (\sum _{i=1}^{p}( \frac{A_{i}} {A_{i}+\lambda })^{2} \times \sum _{ i=1}^{p}[\boldsymbol{P}_{\boldsymbol{ M},\boldsymbol{X}}\boldsymbol{A}]_{ii}^{2}\right )^{1/2} {}\\ & \leq & \left (p \times \sum _{i=1}^{p}\left [\boldsymbol{P}_{\boldsymbol{ M},\boldsymbol{X}}\boldsymbol{A}\right ]_{ii}^{2}\right )^{1/2} {}\\ & =& p^{1/2}\sqrt{\mathrm{tr }\left (\boldsymbol{P}_{\boldsymbol{ M},\boldsymbol{X}}\boldsymbol{A}(\boldsymbol{P}_{\boldsymbol{M},\boldsymbol{X}}\boldsymbol{A})^{T}\right )},\ \ \ \ \text{for all }\lambda \geq 0, {}\\ \end{array}$$

but \(\mathrm{tr}\left (\boldsymbol{P}_{\boldsymbol{M},\boldsymbol{X}}\boldsymbol{AAP}_{\boldsymbol{M},\boldsymbol{X}}^{T}\right ) =\mathrm{ tr}\left (\boldsymbol{X}^{T}\left (\boldsymbol{XMX}^{T}\right )^{-1}\boldsymbol{XMA}^{2}\boldsymbol{MX}^{T}\left (\boldsymbol{XMX}^{T}\right )^{-1}\boldsymbol{X}\right )\)

\(=\mathrm{ tr}\left (\left (\boldsymbol{XMX}^{T}\right )^{-1}(\boldsymbol{XMA}^{2}\boldsymbol{MX}^{T})\left (\boldsymbol{XMX}^{T}\right )^{-1}(\boldsymbol{XX}^{T})\right ) = O(1)\) by (13) and condition (E). Therefore,

$$\displaystyle{\sup \limits _{\boldsymbol{B}\in \mathcal{B}}\left \vert \dfrac{1} {p}\mathrm{tr}\left (\boldsymbol{A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\boldsymbol{P}_{\boldsymbol{ M},\boldsymbol{X}}\boldsymbol{A}\right )\right \vert = \dfrac{1} {p}O\left (p^{1/2}\right )O(1) = O(\,p^{-1/2}) \rightarrow 0.}$$

Under Model II, \(\boldsymbol{A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1} =\boldsymbol{ I}_{p} -\lambda \boldsymbol{ Z}^{T}\left (\lambda \boldsymbol{I}_{k}+\boldsymbol{\varLambda }\right )^{-1}\boldsymbol{\varLambda }\boldsymbol{Z}\boldsymbol{A}^{-1}\), where

\(\boldsymbol{W}^{-1/2}\boldsymbol{V W}^{-1/2} =\boldsymbol{ U\varLambda }\boldsymbol{U}^{T}\), \(\boldsymbol{\varLambda }=\mathrm{ diag}\left (d_{1},\ldots,d_{k}\right )\) with d 1 ≤ ⋯ ≤ d k , and \(\boldsymbol{Z} =\boldsymbol{ U}^{T}\boldsymbol{W}^{1/2}\boldsymbol{X}\) as defined in the proof of Theorem 1. Thus,

$$\displaystyle{\mathrm{tr}\left (\boldsymbol{A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\boldsymbol{P}_{\boldsymbol{ M},\boldsymbol{X}}\boldsymbol{A}\right ) =\mathrm{ tr}\left (\boldsymbol{P}_{\boldsymbol{M},\boldsymbol{X}}\boldsymbol{A}\right ) -\mathrm{ tr}\left (\lambda \boldsymbol{Z}^{T}\left (\lambda \boldsymbol{I}_{ k}+\boldsymbol{\varLambda }\right )^{-1}\boldsymbol{\varLambda }\boldsymbol{Z}\boldsymbol{A}^{-1}\boldsymbol{P}_{\boldsymbol{ M},\boldsymbol{X}}\boldsymbol{A}\right ).}$$

We know that \(\mathrm{tr}\left (\boldsymbol{P}_{\boldsymbol{M},\boldsymbol{X}}\boldsymbol{A}\right ) =\mathrm{ tr}\left (\left (\boldsymbol{XMX}^{T}\right )^{-1}(\boldsymbol{XMAX}^{T})\right ) = O(1)\) by the assumption (13). \(\mathrm{tr}\left (\lambda \boldsymbol{Z}^{T}\left (\lambda \boldsymbol{I}_{k}+\boldsymbol{\varLambda }\right )^{-1}\boldsymbol{\varLambda }\boldsymbol{Z}\boldsymbol{A}^{-1}\boldsymbol{P}_{\boldsymbol{M},\boldsymbol{X}}\boldsymbol{A}\right ) =\mathrm{ tr}\left (\lambda \left (\lambda \boldsymbol{I}_{k}+\boldsymbol{\varLambda }\right )^{-1}\boldsymbol{\varLambda }\boldsymbol{Z}\boldsymbol{A}^{-1}\right.\boldsymbol{P}_{\boldsymbol{M},\boldsymbol{X}}\) \(\left.\boldsymbol{AZ}^{T}\right )\) \(=\mathrm{ tr}\left (\lambda \left (\lambda \boldsymbol{I}_{k}+\boldsymbol{\varLambda }\right )^{-1}\boldsymbol{\varLambda }\boldsymbol{Z}\boldsymbol{A}^{-1}\boldsymbol{X}^{T}\left (\boldsymbol{XMX}^{T}\right )^{-1}\boldsymbol{XMAZ}^{T}\right )\). The Cauchy-Schwarz inequality for matrix trace gives

$$\displaystyle\begin{array}{rcl} & & \left \vert \mathrm{tr}\left (\left (\lambda \left (\lambda \boldsymbol{I}_{k}+\boldsymbol{\varLambda }\right )^{-1}\boldsymbol{\varLambda }\right )\left (\boldsymbol{ZA}^{-1}\boldsymbol{X}^{T}\left (\boldsymbol{XMX}^{T}\right )^{-1}\boldsymbol{XMAZ}^{T}\right )\right )\right \vert {}\\ & & \leq \mathrm{ tr}^{1/2}\left ((\lambda \left (\lambda \boldsymbol{I}_{ k}+\boldsymbol{\varLambda }\right )^{-1}\boldsymbol{\varLambda })^{2}\right ) {}\\ & & \quad \times \mathrm{ tr}^{1/2}\left (\boldsymbol{ZA}^{-1}\boldsymbol{X}^{T}\left (\boldsymbol{XMX}^{T}\right )^{-1}\boldsymbol{XMAZ}^{T}\boldsymbol{ZAMX}^{T}\left (\boldsymbol{XMX}^{T}\right )^{-1}\boldsymbol{XA}^{-1}\boldsymbol{Z}^{T}\right ). {}\\ \end{array}$$

Since

$$\displaystyle\begin{array}{rcl} \mathrm{tr}\left ((\lambda \left (\lambda \boldsymbol{I}_{k}+\boldsymbol{\varLambda }\right )^{-1}\boldsymbol{\varLambda })^{2}\right ) =\sum _{ i=1}^{k}\left ( \dfrac{\lambda d_{i}} {\lambda +d_{i}}\right )^{2} \leq kd_{ k}^{2} = O\left (p^{-2}\right )\ \ \ \ \text{for all }\lambda \geq 0& & {}\\ \end{array}$$

as shown in the proof of Theorem 1 and

$$\displaystyle\begin{array}{rcl} & & \mathrm{tr}\left (\boldsymbol{ZA}^{-1}\boldsymbol{X}^{T}\left (\boldsymbol{XMX}^{T}\right )^{-1}\boldsymbol{XMAZ}^{T}\boldsymbol{ZAMX}^{T}\left (\boldsymbol{XMX}^{T}\right )^{-1}\boldsymbol{XA}^{-1}\boldsymbol{Z}^{T}\right ) {}\\ & & =\mathrm{ tr}\left (\left (\boldsymbol{XMX}^{T}\right )^{-1}\boldsymbol{XMAZ}^{T}\boldsymbol{ZAMX}^{T}\left (\boldsymbol{XMX}^{T}\right )^{-1}\boldsymbol{XA}^{-1}\boldsymbol{Z}^{T}\boldsymbol{ZA}^{-1}\boldsymbol{X}^{T}\right ) {}\\ & & =\mathrm{ tr}\left (\left (\boldsymbol{XMX}^{T}\right )^{-1}(\boldsymbol{XMAX}^{T})\boldsymbol{W}(\boldsymbol{XAMX}^{T})\left (\boldsymbol{XMX}^{T}\right )^{-1}(\boldsymbol{XA}^{-1}\boldsymbol{X}^{T})\boldsymbol{W}(\boldsymbol{XA}^{-1}\boldsymbol{X}^{T})\right ) {}\\ & & = O(\,p^{2}) {}\\ \end{array}$$

from (13) and condition (F), we have

$$\displaystyle{\sup \limits _{\boldsymbol{B}\in \mathcal{B}}\left \vert \dfrac{1} {p}\mathrm{tr}\left (\boldsymbol{A}\left (\boldsymbol{A} +\boldsymbol{ B}\right )^{-1}\boldsymbol{P}_{\boldsymbol{ M},\boldsymbol{X}}\boldsymbol{A}\right )\right \vert = \dfrac{1} {p}\left (O(1) + \sqrt{O\left (p^{-2 } \right ) \times O(\,p^{2 } )}\right ) = O(\,p^{-1}) \rightarrow 0.}$$

This completes our proof of (14). With this established, the rest of the proof is identical to that of Theorem 2 and Corollary 1.

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Kou, S.C., Yang, J.J. (2017). Optimal Shrinkage Estimation in Heteroscedastic Hierarchical Linear Models. In: Ahmed, S. (eds) Big and Complex Data Analysis. Contributions to Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-41573-4_13

Download citation

Publish with us

Policies and ethics