Skip to main content
Log in

Asymptotic bias of the \(\ell _2\)-regularized error variance estimator

  • Research Article
  • Published:
Journal of the Korean Statistical Society Aims and scope Submit manuscript

Abstract

This study considers the \(\ell _2\)-regularized error variance estimator, an effective tool for non-sparse linear models. Specifically, this study investigates previously unexplored theoretical properties of this estimator, particularly its asymptotic bias in high dimensional settings where the number of variables increases with increase in the number of observations. This study proves that the estimator is asymptotically unbiased in large-sample settings (\(\lim p/n <1\)), while it is asymptotically biased in general ultra-high dimensional settings (\(\lim p/n >1\)). Specifically, the asymptotic bias of the \(\ell _2\)-regularized estimator in ultra-high dimensional settings is accurately derived where the covariates are independent.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Bai, Z. D., & Yin, Y. Q. (1993). Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. The Annals of Probability, 21(3), 1275–1294. https://doi.org/10.1214/aop/1176989118

    Article  MathSciNet  Google Scholar 

  • Dicker, L. H. (2014). Variance estimation in high-dimensional linear models. Biometrika, 101(2), 269–284.

    Article  MathSciNet  Google Scholar 

  • Dicker, L. H., & Erdogdu, M. A. (2016). Maximum likelihood for variance estimation in high-dimensional linear models. In Artificial intelligence and statistics, PMLR (pp. 159–167).

  • Dobriban, E., Wager, S., et al. (2018). High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1), 247–279.

    Article  MathSciNet  Google Scholar 

  • Fan, J., Guo, S., & Hao, N. (2012). Variance estimation using refitted cross-validation in ultrahigh dimensional regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(1), 37–65.

    Article  MathSciNet  PubMed  Google Scholar 

  • Greenshtein, E., & Ritov, Y. (2004). Persistence in high-dimensional linear predictor selection and the virtue of over parametrization. Bernoulli, 10(6), 971–988.

    Article  MathSciNet  Google Scholar 

  • Janson, L., Barber, R. F., & Candes, E. (2017). Eigenprism: Inference for high dimensional signal-to-noise ratios. Journal of the Royal Statistical Society Series B, Statistical Methodology, 79(4), 1037.

    Article  MathSciNet  PubMed  Google Scholar 

  • Javanmard, A., & Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research, 15(82), 2869–2909.

    MathSciNet  Google Scholar 

  • Liu, X., Zheng, S., & Feng, X. (2020). Estimation of error variance via ridge regression. Biometrika, 107(2), 481–488.

    MathSciNet  Google Scholar 

  • Marchenko, V. A., & Pastur, L. A. (1967). Distribution of eigenvalues for some sets of random matrices. Matematicheskii Sbornik, 114(4), 507–536.

    Google Scholar 

  • Ning, Y., & Liu, H. (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. The Annals of Statistics, 45(1), 158–195. https://doi.org/10.1214/16-AOS1448

    Article  MathSciNet  Google Scholar 

  • Park, G., Moon, S. J., Park, S., et al. (2021). Learning a high-dimensional linear structural equation model via \(\ell _1\)-regularized regression. The Journal of Machine Learning Research, 22(1), 4607–4647.

    MathSciNet  Google Scholar 

  • Reid, S., Tibshirani, R., & Friedman, J. (2016). A study of error variance estimation in lasso regression. Statistica Sinica, 26(1), 35–67.

    MathSciNet  Google Scholar 

  • Rubio, F., & Mestre, X. (2011). Spectral convergence for a general class of random matrices. Statistics and Probability Letters, 81(5), 592–602.

    Article  MathSciNet  Google Scholar 

  • Silverstein, J. W. (1995). Strong convergence of the empirical distribution of eigenvalues of large dimensional random matrices. Journal of Multivariate Analysis, 55(2), 331–339.

    Article  MathSciNet  Google Scholar 

  • Städler, N., Bühlmann, P., & Van De Geer, S. (2010). \(\ell _1\)-penalization for mixture regression models. Test, 19, 209–256.

    Article  MathSciNet  Google Scholar 

  • Sun, T., & Zhang, C. H. (2012). Scaled sparse linear regression. Biometrika, 99(4), 879–898.

    Article  MathSciNet  Google Scholar 

  • Sun, T., & Zhang, C. H. (2013). Sparse matrix inversion with scaled lasso. The Journal of Machine Learning Research, 14(1), 3385–3418.

    MathSciNet  Google Scholar 

  • van de Geer, S., Bühlmann, P., Ritov, Y., et al. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3), 1166–1202. https://doi.org/10.1214/14-AOS1221

    Article  MathSciNet  Google Scholar 

  • Verzelen, N., & Gassiat, E. (2018). Adaptive estimation of high-dimensional signal-to-noise ratios. Bernoulli, 24(4B), 3683–3710. https://doi.org/10.3150/17-BEJ975

    Article  MathSciNet  Google Scholar 

  • Wang, X., Kong, L., & Wang, L. (2022). Estimation of error variance in regularized regression models via adaptive lasso. Mathematics, 10(11), 1937.

    Article  Google Scholar 

  • Yu, G., & Bien, J. (2019). Estimating the error variance in a high-dimensional linear model. Biometrika, 106(3), 533–546.

    Article  MathSciNet  Google Scholar 

  • Zhang, C. H., & Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society Series B (Statistical Methodology), 76(1), 217–242.

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Research Foundation of Korea(NRF) Grant funded by the Korea government(MSIT) (No. NRF-2021R1C1C2006380) for Semin Choi. Also, this work was supported by the National Research Foundation of Korea(NRF) Grant funded by the Korea government(MSIT) (NRF-2021R1C1C1004562 and RS-2023-00218231) and Institute of Information & communications Technology Planning & Evaluation (IITP) Grant funded by the Korea government(MSIT) [No. 2021-0-01343, Artificial Intelligence Graduate School Program (Seoul National University)] for Gunwoong Park.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gunwoong Park.

Ethics declarations

Conflict of interest

All authors have no conflict of interests to declare that are relevant to the content of this article.

Appendices

Appendix 1: Proof for Theorem 3

Proof

We begin by proving the first theoretical result with general \(\Sigma\). For ease of notation, we use \(\lambda = \lambda _{n}\) as a shorthand. Recall that the bias of the estimator (6) is expressed as follows:

$$\begin{aligned} \frac{{\textrm{E}}\left( {\hat{\sigma }}_\lambda ^2\right) - \sigma ^2}{\Vert {{\varvec{\beta }}}\Vert _2^2} = \frac{\lambda }{1-n^{-1}{\text{ tr }}\left( H_n^{\lambda _n} \right) } \cdot \frac{1}{\Vert {{\varvec{\beta }}}\Vert _2^2} {{\varvec{\beta }}}^T S_n (S_n+\lambda I_p)^{-1} {{\varvec{\beta }}}. \end{aligned}$$
(8)

Simple algebra yields that the denominator of the first term in Eq. (8) is as follows:

$$\begin{aligned} 1- \frac{{\text {tr}}(H_n^{\lambda _n})}{n} = 1- \frac{{\text {tr}}\left( \frac{1}{n}X(S_n + \lambda I_p)^{-1}X^T\right) }{n} = 1-\frac{p}{n} + \frac{p}{n} \cdot \frac{\lambda }{p} {\text {tr}}\bigg (\Big (S_n + \lambda I_p\Big )^{-1}\bigg ). \end{aligned}$$

Meanwhile, for any fixed \(\lambda _0 \in {\mathbb {R}}^+\), the equations (2) and (3) in Dobriban et al. (2018) imply that

$$\begin{aligned} \lim _{n \rightarrow \infty } \frac{\lambda _0}{p} {\text{ tr }}\bigg (\Big (S_n + \lambda _0 I_p\Big )^{-1}\bigg ) = \frac{1}{\tau } \lambda _0 v(-\lambda _0) - \frac{1}{\tau } + 1 \qquad a.s. \end{aligned}$$
(9)

For ease of notation, let \(f_n(\lambda ) := \frac{1}{\lambda }\left\{ \frac{n}{p} - 1 + \frac{\lambda }{p} {\text{ tr }}\left( \left( S_n + \lambda I_p\right) ^{-1}\right) \right\}\). Then, it is noted that \(\lim _{n\rightarrow \infty } f_n(\lambda _0) = \frac{v(-\lambda _0)}{\tau }\) for each \(\lambda _0 \in {\mathbb {R}}^+\). For the existence of \(\lim _{n\rightarrow \infty } f_n(\lambda _n)\), it suffices to show that the sequence \(f_n\), \(n=1,2,\ldots\) is uniformly convergent on \({\mathbb {R}}^+\) by the Moore–Osgood theorem.

Since the nonzero eigenvalues of \(S_n\) and \(\frac{1}{n}XX^T\) are equal where \(S_n\) has 0 as an additional eigenvalue with multiplicity \((p-n)\), we have

$$\begin{aligned} {\text{ tr }}\left( \left( S_n + \lambda I_p\right) ^{-1}\right) = \frac{p-n}{\lambda } + {\text{ tr }}\left( \left( \frac{1}{n}XX^T + \lambda I_n\right) ^{-1}\right) . \end{aligned}$$

Simple algebra yields that

$$\begin{aligned} f_n(\lambda ) = \frac{1}{\lambda }\left\{ \frac{n}{p} - 1 + \frac{\lambda }{p} {\text{ tr }}\left( \left( S_n + \lambda I_p\right) ^{-1}\right) \right\} = \frac{1}{p}{\text{ tr }}\left( \left( \frac{1}{n}XX^T + \lambda I_n\right) ^{-1}\right) . \end{aligned}$$

Hence, we have

$$\begin{aligned} |f_n(\lambda _0)|\le \left|\frac{n}{p\cdot \Lambda _{\min }^+(S_n)}\right|\le \left|\frac{n}{p\cdot \Lambda _{\min }^+ (Z^TZ/n) \Lambda _{\min }(\Sigma )}\right|\le \frac{1}{2\tau }\cdot \frac{2}{\rho _{\min }(1-\sqrt{\tau })^2} \end{aligned}$$

for any fixed \(\lambda _0 \in {\mathbb {R}}^+\) and sufficiently all large n, where \(\Lambda _{\min }^+(\cdot )\) denotes the smallest nonzero eigenvalue. Here, the last inequality follows from the fact that \(\lim _{n\rightarrow \infty }\Lambda _{\min }^+ (Z^TZ/n) = (1-\sqrt{\tau })^2\) by Theorem 2 in Bai and Yin (1993).

Similarly, for the derivative \(f_n'(\lambda )=\partial f_n(\lambda )/\partial \lambda\), it is also observed that

$$\begin{aligned} |f_n'(\lambda _0) |=\left|\frac{1}{p}{\text{ tr }}\left( \left( \frac{1}{n}XX^T + \lambda _0 I_n\right) ^{-2}\right) \right|\le \left|\frac{n}{p\cdot (\Lambda _{\min }^+ (S_n))^2}\right|\le \frac{1}{2\tau }\cdot \frac{4}{\rho _{\min }^2(1-\sqrt{\tau })^4}, \end{aligned}$$

for any fixed \(\lambda _0 \in {\mathbb {R}}^+\) and sufficiently large n.

Hence, \(f_n\) and its derivative are uniformly bounded, and hence the sequence \(f_n\), \(n=1,2,\ldots\) is uniformly convergent by the Arzela–Ascoli theorem. Consequently, it concludes that \(\lim _{n\rightarrow \infty } f_n(\lambda _n)\) exists. In addition, according to Lemma 2.3 in Dobriban et al. (2018), \(\lim _{n\rightarrow \infty } v(-\lambda _n) = v(-c)\) for \(\lim \lambda _n = c\), and hence, we have

$$\begin{aligned} \lim _{n \rightarrow \infty }\frac{1}{\lambda _n}\bigg \{\frac{n}{p} - 1 + \frac{\lambda _n}{p} {\text{ tr }}\bigg (\Big (S_n + \lambda _n I_p\Big )^{-1}\bigg )\bigg \} = \frac{v(-c)}{\tau } \qquad a.s. \end{aligned}$$

for all \(c \in [0,\infty )\).

Multiplying \(\tau\) on both sides, the limit of the first term of the bias is as follows:

$$\begin{aligned} \lim _{n\rightarrow \infty } \frac{\lambda _n}{1-n^{-1}{\text{ tr }}\left( H_n^{\lambda _n}\right) } = \frac{1}{v(-c)} \qquad a.s. \end{aligned}$$
(10)

Now, it is focused on the second part of the bias, \(\frac{1}{\Vert {{\varvec{\beta }}}\Vert _2^2} {{\varvec{\beta }}}^T S_n (S_n+\lambda _n I_p)^{-1} {{\varvec{\beta }}}\). As shown above, \(S_n\) has 0 as an eigenvalue with multiplicity \((p-n)\) and \(\Lambda _{\min }^+(S_n) \ge (1-\sqrt{\tau })^2\rho _{\min }/2\) for sufficiently all large n. Define \(\{e_1,\ldots ,e_{n}\}\) as the eigenvectors of \(S_n\) corresponding to the nonzero eigenvalues and \(\{e_{n+1},\ldots ,e_{p}\}\) as the eigenvectors of \(S_n\) corresponding to the eigenvalue 0. Then, \({{\varvec{\beta }}}\) is decomposed as \({{\varvec{\beta }}}= {{\varvec{\beta }}}_{E_1} + {{\varvec{\beta }}}_{E_2}\) where \(E_1\) and \(E_2\) are subspaces of \({\mathbb {R}}^p\) spanned by \(\{e_1,\ldots ,e_{n}\}\) and \(\{e_{n+1},\ldots ,e_{p}\}\), respectively. Hence, we have

$$\begin{aligned} \frac{1}{\Vert {{\varvec{\beta }}}\Vert _2^2}{{\varvec{\beta }}}^T S_n (S_n+\lambda _n I_p)^{-1} {{\varvec{\beta }}}&= \frac{1}{\Vert {{\varvec{\beta }}}\Vert _2^2}{{\varvec{\beta }}}_{E_1}^T S_n (S_n+\lambda _n I_p)^{-1} {{\varvec{\beta }}}_{E_1} \\&\ge \frac{\Vert {{\varvec{\beta }}}_{E_1}\Vert _2^2}{\Vert {{\varvec{\beta }}}\Vert _2^2} \cdot \frac{(1-\sqrt{\tau })^2\rho _{\min }/2}{\lambda _n+(1-\sqrt{\tau })^2\rho _{\min }/2} \qquad a.s. \end{aligned}$$

for all sufficiently large n.

Applying Eqs. (8) and (10), we obtain

$$\begin{aligned} \liminf _{n\rightarrow \infty } \frac{\left[ {\textrm{E}}\left( {\hat{\sigma }}_{\lambda _n}^2\right) - \sigma ^2\right] }{\Vert {{\varvec{\beta }}}\Vert _2^2} \ge \frac{1}{v(-c)} \cdot \frac{(1-\sqrt{\tau })^2\rho _{\min }/2}{c+(1-\sqrt{\tau })^2\rho _{\min }/2} \cdot \liminf _{n\rightarrow \infty } \frac{\Vert {{\varvec{\beta }}}_{E_1}\Vert _2^2}{\Vert {{\varvec{\beta }}}\Vert _2^2} \qquad a.s. \end{aligned}$$

Since \(E_1\) is determined by the design matrix \(X = Z\Sigma ^{1/2}\) under the setting where Z is randomly generated and \(\Sigma\) has full rank, it holds that \(\liminf _{n\rightarrow \infty } \frac{\Vert {{\varvec{\beta }}}_{E_1}\Vert _2^2}{\Vert {{\varvec{\beta }}}\Vert _2^2} > 0\) almost surely. This completes the proof of the theoretical result with general \(\Sigma\). \(\square\)

Appendix 2: Proof for Theorem 4

Proof

Now, we prove the second theoretical result with \(\Sigma = I_p\). Eq. (10) also holds for \(\Sigma =I_p\), and hence it suffices to focus on the term \(\frac{1}{\Vert {{\varvec{\beta }}}\Vert _2^2} {{\varvec{\beta }}}^T S_n (S_n+\lambda _n I_p)^{-1} {{\varvec{\beta }}}= 1 - \frac{\lambda _n}{\Vert {{\varvec{\beta }}}\Vert _2^2} {{\varvec{\beta }}}^T (S_n+\lambda _n I_p)^{-1} {{\varvec{\beta }}}\). First, we use Theorem 1 in Rubio and Mestre (2011) with \(\Theta = I_p/p\) that implies

$$\begin{aligned} \lim _{n \rightarrow \infty } \left[ \frac{1}{p}{\text{ tr }}\left( (S_n+\lambda _0 I_p)^{-1} \right) - \frac{1}{p}{\text{ tr }}\left( (x_n(\lambda _0) I_p)^{-1})\right) \right] = 0 \qquad a.s., \end{aligned}$$

for any fixed \(\lambda _0 \in {\mathbb {R}}^+\) where \(x_n(\lambda _0)>0\) is the deterministic sequence satisfying a certain fixed-point equation for each n.

By Eq. (9), the following holds that

$$\begin{aligned} \lim _{n\rightarrow \infty } \lambda _0\left( x_n(\lambda _0)\right) ^{-1} = \lim _{n\rightarrow \infty }\frac{\lambda _0}{p}{\text{ tr }}\left( (S_n + \lambda _0 I_p)^{-1}\right) = \frac{1}{\tau } \lambda _0 v(-\lambda _0) - \frac{1}{\tau } + 1, \qquad {a.s.} \end{aligned}$$

for any fixed \(\lambda _0 \in {\mathbb {R}}^+\), and hence we can define \(x(\lambda _0): = \lim _{n\rightarrow \infty } x_n(\lambda _0)\).

Furthermore, by Theorem 1 in Rubio and Mestre (2011) with \(\Theta = {{\varvec{\beta }}}{{\varvec{\beta }}}^T/\Vert {{\varvec{\beta }}}\Vert _2^2\),

$$\begin{aligned} \lim _{n\rightarrow \infty } \frac{1}{\Vert {{\varvec{\beta }}}\Vert _2^2}{{\varvec{\beta }}}^T(S_n+\lambda _0 I_p)^{-1}{{\varvec{\beta }}}- \frac{1}{\Vert {{\varvec{\beta }}}\Vert _2^2}{{\varvec{\beta }}}^T\left( x_n(\lambda _0) I_p\right) ^{-1}{{\varvec{\beta }}}= 0 \qquad a.s. \end{aligned}$$

for any fixed \(\lambda _0 \in {\mathbb {R}}^+\), and it implies that

$$\begin{aligned} \lim _{n\rightarrow \infty } \frac{\lambda _0}{\Vert {{\varvec{\beta }}}\Vert _2^2}{{\varvec{\beta }}}^T(S_n+\lambda _0 I_p)^{-1}{{\varvec{\beta }}}= \lambda _0 x(\lambda _0)^{-1} \qquad a.s. \end{aligned}$$

for any fixed \(\lambda _0 \in {\mathbb {R}}^+\). Following the same argument for Eq. (10), the existence of the limit of \(g_n(\lambda _n) := \frac{\lambda _n}{\Vert {{\varvec{\beta }}}\Vert _2^2}{{\varvec{\beta }}}^T(S_n+\lambda _n I_p)^{-1}{{\varvec{\beta }}}=1-\frac{1}{\Vert {{\varvec{\beta }}}\Vert _2^2}{{\varvec{\beta }}}^TS_n(S_n+\lambda _n I_p)^{-1}{{\varvec{\beta }}}\) is confirmed by the following inequalities:

$$\begin{aligned} |g_n(\lambda _0)|&\le \left|\lambda _0 \cdot \Lambda _{\max } \left( (S_n+\lambda _0 I_p)^{-1}\right) \right|\le 1,\\ |g_n'(\lambda _0)|&= \left|\frac{1}{\Vert {{\varvec{\beta }}}\Vert _2^2}{{\varvec{\beta }}}^T S_n(S_n+\lambda _0 I_p)^{-2} {{\varvec{\beta }}}\right|\le \left|\Lambda _{\max } \left( S_n(S_n+\lambda _0 I_p)^{-2}\right) \right|\\&\le \frac{2}{\rho _{\min }(1-\sqrt{\tau })^2}, \end{aligned}$$

for any fixed \(\lambda _0 \in {\mathbb {R}}^+\) and sufficiently all large n.

Then, it holds that

$$\begin{aligned} \lim _{n\rightarrow \infty } \frac{\lambda _n }{\Vert {{\varvec{\beta }}}\Vert _2^2}{{\varvec{\beta }}}^T(S_n+\lambda _n I_p)^{-1}{{\varvec{\beta }}}&= \lim _{n\rightarrow \infty } \lambda _n x(\lambda _n)^{-1} \\&= \lim _{n\rightarrow \infty } \left[ \frac{1}{\tau } \lambda _n v(-\lambda _n) - \frac{1}{\tau } + 1 \right] \\&= \frac{cv(-c)}{\tau }-\frac{1}{\tau } + 1 \qquad a.s. \end{aligned}$$

Hence, it is concluded that

$$\begin{aligned} \frac{\left[ {\textrm{E}}\left( {\hat{\sigma }}_{\lambda }^2\right) - \sigma ^2 \right] }{\Vert {{\varvec{\beta }}}\Vert _2^2}&= \frac{\lambda }{1-n^{-1}{\text{ tr }}\left( H_n^{\lambda _n} \right) } \cdot \frac{1}{\Vert {{\varvec{\beta }}}\Vert _2^2}{{\varvec{\beta }}}^T S_n (S_n+\lambda I_p)^{-1} {{\varvec{\beta }}}\\ {}&= \frac{\lambda }{1-n^{-1}{\text{ tr }}\left( H_n^{\lambda _n} \right) } \cdot \left[ \frac{\Vert {{\varvec{\beta }}}\Vert _2^2}{\Vert {{\varvec{\beta }}}\Vert _2^2} - \frac{\lambda }{\Vert {{\varvec{\beta }}}\Vert _2^2} {{\varvec{\beta }}}^T(S_n+\lambda I_p)^{-1}{{\varvec{\beta }}}\right] \\&\rightarrow \frac{1}{v(-c)} \cdot \left[ 1- \left( \frac{cv(-c)}{\tau }-\frac{1}{\tau } + 1\right) \right] = \frac{1}{\tau } \left( \frac{1}{v(-c)}-c\right) \qquad a.s. \end{aligned}$$

as \(n\rightarrow \infty\). \(\square\)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Choi, S., Park, G. Asymptotic bias of the \(\ell _2\)-regularized error variance estimator. J. Korean Stat. Soc. 53, 132–148 (2024). https://doi.org/10.1007/s42952-023-00239-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42952-023-00239-y

Keywords

Navigation