Abstract
This study considers the \(\ell _2\)-regularized error variance estimator, an effective tool for non-sparse linear models. Specifically, this study investigates previously unexplored theoretical properties of this estimator, particularly its asymptotic bias in high dimensional settings where the number of variables increases with increase in the number of observations. This study proves that the estimator is asymptotically unbiased in large-sample settings (\(\lim p/n <1\)), while it is asymptotically biased in general ultra-high dimensional settings (\(\lim p/n >1\)). Specifically, the asymptotic bias of the \(\ell _2\)-regularized estimator in ultra-high dimensional settings is accurately derived where the covariates are independent.
Similar content being viewed by others
References
Bai, Z. D., & Yin, Y. Q. (1993). Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. The Annals of Probability, 21(3), 1275–1294. https://doi.org/10.1214/aop/1176989118
Dicker, L. H. (2014). Variance estimation in high-dimensional linear models. Biometrika, 101(2), 269–284.
Dicker, L. H., & Erdogdu, M. A. (2016). Maximum likelihood for variance estimation in high-dimensional linear models. In Artificial intelligence and statistics, PMLR (pp. 159–167).
Dobriban, E., Wager, S., et al. (2018). High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1), 247–279.
Fan, J., Guo, S., & Hao, N. (2012). Variance estimation using refitted cross-validation in ultrahigh dimensional regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(1), 37–65.
Greenshtein, E., & Ritov, Y. (2004). Persistence in high-dimensional linear predictor selection and the virtue of over parametrization. Bernoulli, 10(6), 971–988.
Janson, L., Barber, R. F., & Candes, E. (2017). Eigenprism: Inference for high dimensional signal-to-noise ratios. Journal of the Royal Statistical Society Series B, Statistical Methodology, 79(4), 1037.
Javanmard, A., & Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research, 15(82), 2869–2909.
Liu, X., Zheng, S., & Feng, X. (2020). Estimation of error variance via ridge regression. Biometrika, 107(2), 481–488.
Marchenko, V. A., & Pastur, L. A. (1967). Distribution of eigenvalues for some sets of random matrices. Matematicheskii Sbornik, 114(4), 507–536.
Ning, Y., & Liu, H. (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. The Annals of Statistics, 45(1), 158–195. https://doi.org/10.1214/16-AOS1448
Park, G., Moon, S. J., Park, S., et al. (2021). Learning a high-dimensional linear structural equation model via \(\ell _1\)-regularized regression. The Journal of Machine Learning Research, 22(1), 4607–4647.
Reid, S., Tibshirani, R., & Friedman, J. (2016). A study of error variance estimation in lasso regression. Statistica Sinica, 26(1), 35–67.
Rubio, F., & Mestre, X. (2011). Spectral convergence for a general class of random matrices. Statistics and Probability Letters, 81(5), 592–602.
Silverstein, J. W. (1995). Strong convergence of the empirical distribution of eigenvalues of large dimensional random matrices. Journal of Multivariate Analysis, 55(2), 331–339.
Städler, N., Bühlmann, P., & Van De Geer, S. (2010). \(\ell _1\)-penalization for mixture regression models. Test, 19, 209–256.
Sun, T., & Zhang, C. H. (2012). Scaled sparse linear regression. Biometrika, 99(4), 879–898.
Sun, T., & Zhang, C. H. (2013). Sparse matrix inversion with scaled lasso. The Journal of Machine Learning Research, 14(1), 3385–3418.
van de Geer, S., Bühlmann, P., Ritov, Y., et al. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3), 1166–1202. https://doi.org/10.1214/14-AOS1221
Verzelen, N., & Gassiat, E. (2018). Adaptive estimation of high-dimensional signal-to-noise ratios. Bernoulli, 24(4B), 3683–3710. https://doi.org/10.3150/17-BEJ975
Wang, X., Kong, L., & Wang, L. (2022). Estimation of error variance in regularized regression models via adaptive lasso. Mathematics, 10(11), 1937.
Yu, G., & Bien, J. (2019). Estimating the error variance in a high-dimensional linear model. Biometrika, 106(3), 533–546.
Zhang, C. H., & Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society Series B (Statistical Methodology), 76(1), 217–242.
Acknowledgements
This work was supported by the National Research Foundation of Korea(NRF) Grant funded by the Korea government(MSIT) (No. NRF-2021R1C1C2006380) for Semin Choi. Also, this work was supported by the National Research Foundation of Korea(NRF) Grant funded by the Korea government(MSIT) (NRF-2021R1C1C1004562 and RS-2023-00218231) and Institute of Information & communications Technology Planning & Evaluation (IITP) Grant funded by the Korea government(MSIT) [No. 2021-0-01343, Artificial Intelligence Graduate School Program (Seoul National University)] for Gunwoong Park.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors have no conflict of interests to declare that are relevant to the content of this article.
Appendices
Appendix 1: Proof for Theorem 3
Proof
We begin by proving the first theoretical result with general \(\Sigma\). For ease of notation, we use \(\lambda = \lambda _{n}\) as a shorthand. Recall that the bias of the estimator (6) is expressed as follows:
Simple algebra yields that the denominator of the first term in Eq. (8) is as follows:
Meanwhile, for any fixed \(\lambda _0 \in {\mathbb {R}}^+\), the equations (2) and (3) in Dobriban et al. (2018) imply that
For ease of notation, let \(f_n(\lambda ) := \frac{1}{\lambda }\left\{ \frac{n}{p} - 1 + \frac{\lambda }{p} {\text{ tr }}\left( \left( S_n + \lambda I_p\right) ^{-1}\right) \right\}\). Then, it is noted that \(\lim _{n\rightarrow \infty } f_n(\lambda _0) = \frac{v(-\lambda _0)}{\tau }\) for each \(\lambda _0 \in {\mathbb {R}}^+\). For the existence of \(\lim _{n\rightarrow \infty } f_n(\lambda _n)\), it suffices to show that the sequence \(f_n\), \(n=1,2,\ldots\) is uniformly convergent on \({\mathbb {R}}^+\) by the Moore–Osgood theorem.
Since the nonzero eigenvalues of \(S_n\) and \(\frac{1}{n}XX^T\) are equal where \(S_n\) has 0 as an additional eigenvalue with multiplicity \((p-n)\), we have
Simple algebra yields that
Hence, we have
for any fixed \(\lambda _0 \in {\mathbb {R}}^+\) and sufficiently all large n, where \(\Lambda _{\min }^+(\cdot )\) denotes the smallest nonzero eigenvalue. Here, the last inequality follows from the fact that \(\lim _{n\rightarrow \infty }\Lambda _{\min }^+ (Z^TZ/n) = (1-\sqrt{\tau })^2\) by Theorem 2 in Bai and Yin (1993).
Similarly, for the derivative \(f_n'(\lambda )=\partial f_n(\lambda )/\partial \lambda\), it is also observed that
for any fixed \(\lambda _0 \in {\mathbb {R}}^+\) and sufficiently large n.
Hence, \(f_n\) and its derivative are uniformly bounded, and hence the sequence \(f_n\), \(n=1,2,\ldots\) is uniformly convergent by the Arzela–Ascoli theorem. Consequently, it concludes that \(\lim _{n\rightarrow \infty } f_n(\lambda _n)\) exists. In addition, according to Lemma 2.3 in Dobriban et al. (2018), \(\lim _{n\rightarrow \infty } v(-\lambda _n) = v(-c)\) for \(\lim \lambda _n = c\), and hence, we have
for all \(c \in [0,\infty )\).
Multiplying \(\tau\) on both sides, the limit of the first term of the bias is as follows:
Now, it is focused on the second part of the bias, \(\frac{1}{\Vert {{\varvec{\beta }}}\Vert _2^2} {{\varvec{\beta }}}^T S_n (S_n+\lambda _n I_p)^{-1} {{\varvec{\beta }}}\). As shown above, \(S_n\) has 0 as an eigenvalue with multiplicity \((p-n)\) and \(\Lambda _{\min }^+(S_n) \ge (1-\sqrt{\tau })^2\rho _{\min }/2\) for sufficiently all large n. Define \(\{e_1,\ldots ,e_{n}\}\) as the eigenvectors of \(S_n\) corresponding to the nonzero eigenvalues and \(\{e_{n+1},\ldots ,e_{p}\}\) as the eigenvectors of \(S_n\) corresponding to the eigenvalue 0. Then, \({{\varvec{\beta }}}\) is decomposed as \({{\varvec{\beta }}}= {{\varvec{\beta }}}_{E_1} + {{\varvec{\beta }}}_{E_2}\) where \(E_1\) and \(E_2\) are subspaces of \({\mathbb {R}}^p\) spanned by \(\{e_1,\ldots ,e_{n}\}\) and \(\{e_{n+1},\ldots ,e_{p}\}\), respectively. Hence, we have
for all sufficiently large n.
Applying Eqs. (8) and (10), we obtain
Since \(E_1\) is determined by the design matrix \(X = Z\Sigma ^{1/2}\) under the setting where Z is randomly generated and \(\Sigma\) has full rank, it holds that \(\liminf _{n\rightarrow \infty } \frac{\Vert {{\varvec{\beta }}}_{E_1}\Vert _2^2}{\Vert {{\varvec{\beta }}}\Vert _2^2} > 0\) almost surely. This completes the proof of the theoretical result with general \(\Sigma\). \(\square\)
Appendix 2: Proof for Theorem 4
Proof
Now, we prove the second theoretical result with \(\Sigma = I_p\). Eq. (10) also holds for \(\Sigma =I_p\), and hence it suffices to focus on the term \(\frac{1}{\Vert {{\varvec{\beta }}}\Vert _2^2} {{\varvec{\beta }}}^T S_n (S_n+\lambda _n I_p)^{-1} {{\varvec{\beta }}}= 1 - \frac{\lambda _n}{\Vert {{\varvec{\beta }}}\Vert _2^2} {{\varvec{\beta }}}^T (S_n+\lambda _n I_p)^{-1} {{\varvec{\beta }}}\). First, we use Theorem 1 in Rubio and Mestre (2011) with \(\Theta = I_p/p\) that implies
for any fixed \(\lambda _0 \in {\mathbb {R}}^+\) where \(x_n(\lambda _0)>0\) is the deterministic sequence satisfying a certain fixed-point equation for each n.
By Eq. (9), the following holds that
for any fixed \(\lambda _0 \in {\mathbb {R}}^+\), and hence we can define \(x(\lambda _0): = \lim _{n\rightarrow \infty } x_n(\lambda _0)\).
Furthermore, by Theorem 1 in Rubio and Mestre (2011) with \(\Theta = {{\varvec{\beta }}}{{\varvec{\beta }}}^T/\Vert {{\varvec{\beta }}}\Vert _2^2\),
for any fixed \(\lambda _0 \in {\mathbb {R}}^+\), and it implies that
for any fixed \(\lambda _0 \in {\mathbb {R}}^+\). Following the same argument for Eq. (10), the existence of the limit of \(g_n(\lambda _n) := \frac{\lambda _n}{\Vert {{\varvec{\beta }}}\Vert _2^2}{{\varvec{\beta }}}^T(S_n+\lambda _n I_p)^{-1}{{\varvec{\beta }}}=1-\frac{1}{\Vert {{\varvec{\beta }}}\Vert _2^2}{{\varvec{\beta }}}^TS_n(S_n+\lambda _n I_p)^{-1}{{\varvec{\beta }}}\) is confirmed by the following inequalities:
for any fixed \(\lambda _0 \in {\mathbb {R}}^+\) and sufficiently all large n.
Then, it holds that
Hence, it is concluded that
as \(n\rightarrow \infty\). \(\square\)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Choi, S., Park, G. Asymptotic bias of the \(\ell _2\)-regularized error variance estimator. J. Korean Stat. Soc. 53, 132–148 (2024). https://doi.org/10.1007/s42952-023-00239-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42952-023-00239-y