Communication-efficient distributed estimation for high-dimensional large-scale linear regression

Liu, Zhan; Zhao, Xiaoluo; Pan, Yingli

doi:10.1007/s00184-022-00878-x

Communication-efficient distributed estimation for high-dimensional large-scale linear regression

Published: 11 August 2022

Volume 86, pages 455–485, (2023)
Cite this article

Metrika Aims and scope Submit manuscript

Zhan Liu¹,
Xiaoluo Zhao¹ &
Yingli Pan¹

457 Accesses
1 Altmetric
Explore all metrics

Abstract

In the Master-Worker distributed structure, this paper provides a regularized gradient-enhanced loss (GEL) function based on the high-dimensional large-scale linear regression with SCAD and adaptive LASSO penalty. The importance and originality of this paper have two aspects: (1) Computationally, to take full advantage of the computing power of each machine and speed up the convergence, our proposed distributed upgraded estimation method can make all Workers optimize their corresponding GEL functions in parallel, and the results are then aggregated by the Master; (2) In terms of communication, the proposed modified proximal alternating direction method of the multipliers (ADMM) algorithm is comparable to the Centralize method based on the full sample during a few rounds of communication. Under some mild assumptions, we establish the Oracle properties of the SCAD and adaptive LASSO penalized linear regression. The finite sample properties of the newly suggested method are assessed through simulation studies. An application to the HIV drug susceptibility study demonstrates the utility of the proposed method in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Frank-Wolfe Algorithm: A Short Introduction

Article Open access 13 December 2023

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Article 13 April 2024

Multi-consensus decentralized primal-dual fixed point algorithm for distributed learning

Article 08 April 2024

References

Attouch H (2020) Fast inertial proximal ADMM algorithms for convex structured optimization with linear constraint. https://hal.archives-ouvertes.fr/hal-02501604
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine learning 3(1):1–122
Article MATH Google Scholar
Chen X, Liu W, Mao X, Yang Z (2020) Distributed high-dimensional regression under a quantile loss function. J Mach Learn Res 21(182):1–43
MathSciNet MATH Google Scholar
Cheng G, Shang Z (2015) Computational limits of divide-and-conquer method. arXiv preprint arXiv:1512.09226, 8
Coffin JM (1995) HIV population dynamics in vivo: implications for genetic variation, pathogenesis, and therapy. Science 267(5197):483–489
Article Google Scholar
Duchi JC, Agarwal A, Wainwright MJ (2012) Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Trans Autom Control 57(3):592–606
Article MathSciNet MATH Google Scholar
Eckstein J (1994) Some saddle-function splitting methods for convex programming. Optimization Methods and Software 4(1):75–83
Article MathSciNet Google Scholar
Fan J, Guo Y, Wang K (2021) Communication-efficient accurate statistical estimation. Journal of the American Statistical Association, 1-11
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360
Article MathSciNet MATH Google Scholar
Fan Y, Lin N, Yin X (2021) Penalized quantile regression for distributed big data using the slack variable representation. J Comput Graph Stat 30(3):557–565
Article MathSciNet MATH Google Scholar
Fazel M, Pong TK, Sun D, Tseng P (2013) Hankel matrix rank minimization with applications to system identification and realization. SIAM J Matrix Anal Appl 34(3):946–977
Article MathSciNet MATH Google Scholar
Gao Y, Liu W, Wang H, Wang X, Yan Y, Zhang R (2021) A review of distributed statistical inference. Statistical Theory and Related Fields, 1-11
Gu Y, Fan J, Kong L, Ma S, Zou H (2018) ADMM for high-dimensional sparse penalized quantile regression. Technometrics 60(3):319–331
Article MathSciNet Google Scholar
He B, Liao LZ, Han D, Yang H (2002) A new inexact alternating directions method for monotone variational inequalities. Math Program 92(1):103–118
Article MathSciNet MATH Google Scholar
Hjort NL, Pollard D (2011) Asymptotics for minimisers of convex processes. arXiv preprint arXiv:1107.3806
Hu A, Li C, Wu J (2021) Communication-efficient modeling with penalized quantile regression for distributed data. Complexity, 2021
Huang C, Huo X (2019) A distributed one-step estimator. Math Program 174(1):41–76
Article MathSciNet MATH Google Scholar
Jordan MI, Lee JD, Yang Y (2019) Communication-efficient distributed statistical inference. J Am Stat Assoc 114(526):668–681
Article MathSciNet MATH Google Scholar
Kannan R, Vempala S, Woodruff D (2014) Principal component analysis and higher correlations for distributed data. In Conference on Learning Theory 35:1040–1057
Google Scholar
Lee JD, Sun Y, Liu Q, Taylor JE (2015) Communication-efficient sparse regression: a one-shot approach. arXiv preprint arXiv:1503-1503
Lian H, Liu J, Fan Z (2021) Distributed learning for sketched kernel regression. Neural Netw 143:368–376
Article Google Scholar
Lin SB, Guo X, Zhou DX (2017) Distributed learning with regularized least squares. The Journal of Machine Learning Research 18(1):3202–3232
MathSciNet Google Scholar
Lu J, Cheng G, Liu H (2016) Nonparametric heterogeneity testing for massive data. arXiv preprint arXiv:1601.06212
Mcdonald R, Mohri M, Silberman N, Walker D, Mann G (2009) Efficient large-scale distributed training of conditional maximum entropy models. Adv Neural Inf Process Syst 22:1231–1239
Google Scholar
Pan Y, Liu Z, Cai W (2020) Large-scale expectile regression with covariates missing at random. IEEE Access 8:36502–36513
Article Google Scholar
Pan Y (2021) Distributed optimization and statistical learning for large-scale penalized expectile regression. Journal of the Korean Statistical Society 50(1):290–314
Article MathSciNet MATH Google Scholar
Pollard D (1991) Asymptotics for least absolute deviation regression estimators. Economet Theor 7(2):186–199
Article MathSciNet Google Scholar
Rhee SY, Gonzales MJ, Kantor R, Betts BJ, Ravela J, Shafer RW (2003) Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Res 31(1):298–303
Article Google Scholar
Rosenblatt JD, Nadler B (2016) On the optimality of averaging in distributed statistical learning. Information and Inference: A Journal of the IMA 5(4):379–404
Article MathSciNet MATH Google Scholar
Shamir O, Srebro N, Zhang T (2014) Communication-efficient distributed optimization using an approximate newton-type method. In International Conference on Machine Learning 32(2):1000–1008
Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc: Ser B (Methodol) 58(1):267–288
MathSciNet MATH Google Scholar
Wang HJ, McKeague IW, Qian M (2018) Testing for marginal linear effects in quantile regression. J R Stat Soc Ser B Stat Methodol 80(2):433–433
Article MathSciNet MATH Google Scholar
Wang J, Kolar M, Srebro N, Zhang T (2017) Efficient distributed learning with sparsity. In International Conference on Machine Learning 70:3636–3645
Google Scholar
Wang L, Lian H (2020) Communication-efficient estimation of high-dimensional quantile regression. Anal Appl 18(06):1057–1075
Article MathSciNet MATH Google Scholar
Xu G, Shang Z, Cheng G (2018) Optimal tuning for divide-and-conquer kernel ridge regression with massive data. In International Conference on Machine Learning 80:5483–5491
Google Scholar
Zhang Y, Duchi JC, Wainwright MJ (2013) Communication-efficient algorithms for statistical optimization. The Journal of Machine Learning Research 14(1):3321–3363
MathSciNet MATH Google Scholar
Zhang Y, Lin X (2015) DiSCO: Distributed optimization for self-concordant empirical loss. In International conference on machine learning 37:362–370
Google Scholar
Zhang CH, Zhang T (2012) A general theory of concave regularization for high-dimensional sparse estimation problems. Stat Sci 27(4):576–593
Article MathSciNet MATH Google Scholar
Zhao W, Zhang F, Lian H (2019) Debiasing and distributed estimation for high-dimensional quantile regression. IEEE Transactions on Neural Networks and Learning Systems 31(7):2569–2577
MathSciNet Google Scholar
Zhu X, Li F, Wang H (2021) Least-square approximation for a distributed system. J Comput Graph Stat 30:1004–1018
Article MathSciNet MATH Google Scholar
Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429
Article MathSciNet MATH Google Scholar
Zou H, Li R (2008) One-step sparse estimates in nonconcave penalized likelihood models. Ann Stat 36(4):1509–1509
MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (NSFC) (No. 11901175), the Science and Technology Research Project of Hubei Education Department (No. Q20211007).

Author information

Authors and Affiliations

Hubei Key Laboratory of Applied Mathematics, Faculty of Mathematics and Statistics, Hubei University, Wuhan, 430062, China
Zhan Liu, Xiaoluo Zhao & Yingli Pan

Authors

Zhan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoluo Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yingli Pan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yingli Pan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Proof of Theorem

In this section, all needed lemmas and proofs are listed.

Lemma 1

Let $\left\{ S_{n}(u):u\in U\right\} $ be a sequence of random convex functions defined on convex, open subset U of ${\mathbb {R}}^{p}$. Suppose $S(\cdot )$ is a real-valued function on U for $S_{n}(u)\longrightarrow S(u)$ in probability for each $u\in U$. Then for each compact subset K of U, $\sup \limits _{u\in K}|S_{n}(u)-S(u)|\longrightarrow 0$ in probability and the function $S(\cdot )$ is necessarily convex on U.

The proof of Lemma 1 can be found in Pollard (1991).

Lemma 2

Let V be a symmetric and positive definite matrix, W be a random variable and $A_{n}(u)$ be a convex objective function. If

$$\begin{aligned} A_{n}(u)=\frac{1}{2}u^{T}Vu+W^{T}u+o_{p}(1), \end{aligned}$$

then $\alpha _{n}$, the $\arg \min $ of $A_{n}(u)$, satisfies: $\alpha _{n} \xrightarrow {\ d\ }-V^{-1}W$.

The proof of Lemma 2 can be found in Hjort and Pollard (2011).

Proof of Theorem 1

In this case, note that ${\widetilde{F}}(\beta )$, defined in (2.7), is not a convex function since the non-convexity of SCAD penalty. Based on this fact, we should consider the local minimizer instead of the global ${\widehat{\beta }}^{SCAD}$.

According to Lemma 1 and Fan and Li (2001), for any given $\delta >0$, there is a large enough constant M such that

$$\begin{aligned} \text {P}\left[ \inf \limits _{\Vert u\Vert _{2}=M}{\widetilde{F}}_{SCAD}(\beta _{0}+u/\sqrt{n})>{\widetilde{F}}_{SCAD}(\beta _{0}) \right] \ge 1-\delta , \end{aligned}$$

(A.1)

which implies that with probability at least $1-\delta $ there exists a local minimum in the ball $\left\{ \beta _{0}+u/\sqrt{n}:\Vert u\Vert _{2}\le M\right\} $. This in turn implies that there exists a local minimizer such that $\Vert {\widehat{\beta }}^{(SCAD)}-\beta _{0}\Vert _{2}=O_{p}(n^{-1/2})$. Performing some simple calculations, we have

$$\begin{aligned}&n\left[ {\widetilde{F}}_{SCAD}(\beta _{0}+u/\sqrt{n})-{\widetilde{F}}_{SCAD}(\beta _{0}) \right] \nonumber \\&\quad =n\left[ F_{j}(\beta _{0}+u/\sqrt{n})-F_{j}(\beta _{0}) \right] +n\left[ \langle F'_{N}({\overline{\beta }})-F'_{j}({\overline{\beta }}),u/\sqrt{n}\rangle \right] \nonumber \\&\qquad +n\left[ \sum _{k=1}^{p}p'_{\lambda }(|{\overline{\beta }}_{0k}|)|\beta _{0k}+u_{k}/\sqrt{n}| \right] -n\left[ \sum _{k=1}^{p}p'_{\lambda }(|{\overline{\beta }}_{0k}|)|\beta _{0k}| \right] \nonumber \\&\quad =\sum _{i=1}^{n}\left[ \rho (\epsilon _{ji}-x_{ji}^{T}u/\sqrt{n})-\rho (\epsilon _{ji}) \right] +n\left[ \langle F'_{N}({\overline{\beta }})-F'_{j}({\overline{\beta }}),u/\sqrt{n}\rangle \right] \nonumber \\&\qquad +n\left[ \sum _{k=1}^{p}p'_{\lambda }(|{\overline{\beta }}_{0k}|)\left( |\beta _{0k}+u_{k}/\sqrt{n}|-|\beta _{0k}| \right) \right] \nonumber \\&\quad \ge S_{n}(u)+L_{n}(u), \end{aligned}$$

(A.2)

where $\epsilon _{ji}=y_{ji}-x_{ji}^{T}\beta _{0}$, and $q<p$

$$\begin{aligned} S_{n}(u)&=\sum _{i=1}^{n}\left[ \rho (\epsilon _{ji}-x_{ji}^{T}u/\sqrt{n})-\rho (\epsilon _{ji}) \right] +n\left[ \langle F'_{N}({\overline{\beta }})-F'_{j}({\overline{\beta }}),u/\sqrt{n}\rangle \right] , \end{aligned}$$

(A.3)

$$\begin{aligned} L_{n}(u)&=n\left[ \sum _{k=1}^{q}p'_{\lambda }(|{\overline{\beta }}_{0k}|)\left( |\beta _{0k}+u_{k}/\sqrt{n}|-|\beta _{0k}| \right) \right] . \end{aligned}$$

(A.4)

Given any fixed u, then

$$\begin{aligned} S_{n}(u)&=\sum _{i=1}^{n}\left[ \rho (\epsilon _{ji}-x_{ji}^{T}u/\sqrt{n})-\rho (\epsilon _{ji}) \right] +n\left[ \langle F'_{N}({\overline{\beta }})- F'_{j}({\overline{\beta }}),u/\sqrt{n}\rangle \right] \nonumber \\&=S1+S2. \end{aligned}$$

(A.5)

Due to $\rho (u)=u^{2}$, then $\rho '(u)=2u$ and $\rho '(\epsilon _{ji})=2\epsilon _{ji}$. According to $E(\epsilon _{ji})=0$ in Condition (A1), we have $2E(\epsilon _{ji})=0$. Denote $Q_{ji}(t)=\text {E}[\rho (\epsilon _{ji}-t)-\rho (\epsilon _{ji})]$, by the second-order Taylor expansion of $Q_{ji}(t)$ at 0, we have

$$\begin{aligned} Q_{ji}(t)=\text {E}\left[ (\epsilon _{ji}-t)^{2}-(\epsilon _{ji})^{2} \right] =t^{2}+o(t^{2}). \end{aligned}$$

(A.6)

According to Condition (A2), we have $\sum _{i=1}^{n}x_{i}x_{i}^{T}/{n}\longrightarrow I$. Further, we have

$$\begin{aligned}&\text {E}\left[ \sum _{i=1}^{n}\rho (\epsilon _{ji}-x_{ji}^{T}u/\sqrt{n})-\rho (\epsilon _{ji}) \right] \nonumber \\&\quad =\sum _{i=1}^{n}Q_{ji}(x_{ji}^{T}u/\sqrt{n}) \nonumber \\&\quad =\sum _{i=1}^{n}\left( \frac{x_{ji}^{T}u}{\sqrt{n}} \right) ^{T}\left( \frac{x_{ji}^{T}u}{\sqrt{n}} \right) +o(1) \nonumber \\&\quad =u^{T}\frac{\sum _{i=1}^{n}x_{i}x_{i}^{T}}{n}u+o(1), \end{aligned}$$

(A.7)

where u is fixed.

Denote $R_{ji,n}=\rho (\epsilon _{ji}-x_{ji}^{T}u/\sqrt{n})-\rho (\epsilon _{ji})+2\epsilon _{ji} x_{ji}^{T}u/\sqrt{n}$, then by Taylor expansion, there exist $g_{ji}$ between $\epsilon _{ji}-x_{ji}^{T}u/\sqrt{n}$ and $\epsilon _{ji}$ such that

$$\begin{aligned} |R_{ji,n}|=\left| \frac{\rho ''(g_{ji})}{2}\right| u^{T}\frac{x_{ji}x_{ji}^{T}}{n}u =u^{T}\frac{x_{ji}x_{ji}^{T}}{n}u, \end{aligned}$$

(A.8)

since $\rho ''(g_{ji})=2$. Then we have

$$\begin{aligned} \sum _{i=1}^{n}\text {E}R_{ji,n}^{2}\le \sum _{i=1}^{n}\left[ u^{T}\frac{x_{ji}x_{ji}^{T}}{n}u \times \max \limits _{1\le i\le n}\left( \frac{\Vert x_{ji}\Vert }{\sqrt{n}} \right) ^{2}\times \Vert u\Vert ^{2} \right] . \end{aligned}$$

(A.9)

Under the Condition (A2), we have $\sum _{i=1}^{n} u^{T}\left[ (x_{ji}x_{ji}^{T})/{n} \right] u\longrightarrow u^{T}Iu$. Then, by the fact that $\max \limits _{1\le i\le n}(\Vert x_{ji}\Vert /{\sqrt{n}})\longrightarrow 0$ and $\Vert u\Vert _{2}=M$, we have

$$\begin{aligned} \sum _{i=1}^{n}\text {E}R_{ji,n}^{2}\longrightarrow 0. \end{aligned}$$

(A.10)

Based on the above results, we have

$$\begin{aligned} \begin{aligned} S1&=\text {E}\left[ \sum _{i=1}^{n}\left[ \rho (\epsilon _{ji}-x_{ji}^{T}u/\sqrt{n})-\rho (\epsilon _{ji}) \right] \right] +\sum _{i=1}^{n}\left[ \rho (\epsilon _{ji}-x_{ji}^{T}u/\sqrt{n})-\rho (\epsilon _{ji}) \right] \\&\quad -\text {E}\left[ \sum _{i=1}^{n}\left[ \rho (\epsilon _{ji}-x_{ji}^{T}u/\sqrt{n})-\rho (\epsilon _{ji}) \right] \right] \\&=u^{T}\frac{\sum _{i=1}^{n}x_{ji}x_{ji}^{T}}{n}u-\left[ \frac{2}{\sqrt{n}}\sum _{i=1}^{n}\epsilon _{ji}x_{ji} \right] ^{T}u+\sum _{i=1}^{n}(R_{ji,n}-\text {E}R_{ji,n})+o_{p}(1). \end{aligned} \end{aligned}$$

(A.11)

Combining $\text {E}\left[ \sum _{i=1}^{n}(R_{ji,n}-\text {E}R_{ji,n}) \right] ^{2}\le \sum _{i=1}^{n}\text {E}R_{ji,n}^{2}$ with (A.11), we can obtain that $\sum _{i=1}^{n}(R_{ji,n}-\text {E}R_{ji,n})\longrightarrow 0$. Thus, we can get

$$\begin{aligned} S1=u^{T}\frac{\sum _{i=1}^{n}x_{ji}x_{ji}^{T}}{n}u -\left[ \frac{2}{\sqrt{n}}\sum _{i=1}^{n}\epsilon _{ji}x_{ji} \right] ^{T}u+o_{p}(1). \end{aligned}$$

(A.12)

For the S2, we have

$$\begin{aligned} F'_{N}({\overline{\beta }})-F'_{j}({\overline{\beta }})=\frac{2}{n}\sum _{i=1}^{n}\left[ {\overline{\epsilon }}_{ji}x_{ji} -\frac{1}{m}\sum _{j=1}^{m}{\overline{\epsilon }}_{ji}x_{ji} \right] , \end{aligned}$$

(A.13)

where ${\overline{\epsilon }}_{ji}=y_{ji}-x_{ji}^{T}{\overline{\beta }}$. Further, we have

$$\begin{aligned} S2=n\left[ \langle F'_{N}({\overline{\beta }})-F'_{j}({\overline{\beta }}),u/\sqrt{n}\rangle \right] =\frac{2}{\sqrt{n}}\sum _{i=1}^{n}\left[ {\overline{\epsilon }}_{ji}x_{ji} -\frac{1}{m}\sum _{j=1}^{m}{\overline{\epsilon }}_{ji}x_{ji} \right] ^{T}u. \end{aligned}$$

(A.14)

Now we can simplify $S_{n}(u)$ to

$$\begin{aligned} \begin{aligned} S_{n}(u)&=S1+S2\\&=u^{T}\frac{\sum _{i=1}^{n}x_{ji}x_{ji}^{T}}{n}u -\left[ \frac{2}{\sqrt{n}}\sum _{i=1}^{n}\epsilon _{ji}x_{ji} \right] ^{T}u+o_{p}(1)\\&\quad + \frac{2}{\sqrt{n}}\sum _{i=1}^{n}\left[ {\overline{\epsilon }}_{ji}x_{ji} -\frac{1}{m}\sum _{j=1}^{m}{\overline{\epsilon }}_{ji}x_{ji} \right] ^{T}u\\&=u^{T}\frac{\sum _{i=1}^{n}x_{ji}x_{ji}^{T}}{n}u+\left( \frac{1}{\sqrt{n}} \sum _{i=1}^{n}D_{i} \right) ^{T}u+o_{p}(1)\\&=u^{T}\frac{\sum _{i=1}^{n}x_{ji}x_{ji}^{T}}{n}u+W_{n}^{T}u+o_{p}(1), \end{aligned} \end{aligned}$$

(A.15)

where $\xi _{i}=2{\overline{\epsilon }}_{ji}x_{ji}, \eta _{i}=2\sum _{j=1}^{m}{\overline{\epsilon }}_{ji}x_{ji}, \zeta _{i}=2\epsilon _{ji}x_{ji}, D_{i}=\xi _{i}-{\eta _{i}}/{m}-\zeta _{i},i=1,2,\ldots ,n, W_{n}=({1}/{\sqrt{n}})\sum _{i=1}^{n}D_{i}$.

We deal with $W_{n}$, according to Condition (A1), we can get

$$\begin{aligned} \begin{aligned}&\text {Cov}(\xi _{i},\xi _{i})=ax_{ji}x_{ji}^{T}, \quad \text {Cov}(\eta _{i},\eta _{i})=a\sum _{j=1}^{m}x_{ji}x_{ji}^{T}, \quad \text {Cov}(\zeta _{i},\zeta _{i})=4\sigma ^{2}x_{ji}x_{ji}^{T}, \quad \\&\text {Cov}(\xi _{i},\eta _{i})=ax_{ji}x_{ji}^{T}, \quad \text {Cov}(\xi _{i},\zeta _{i})=bx_{ji}x_{ji}^{T}, \quad \text {Cov}(\eta _{i},\zeta _{i})=bx_{ji}x_{ji}^{T}, \quad \end{aligned} \end{aligned}$$

where $a=4\text {Var}({\overline{\epsilon }}_{ji}), b=4\text {Cov}({\overline{\epsilon }}_{ji},\epsilon _{ji})$. Since $D_{i}$ is independent and identically distributed zero-mean random vectors, we have

$$\begin{aligned} D_{i}=\left[ I_{p\times p},-\frac{I_{p\times p}}{m},-I_{p\times p} \right] _{p\times 3p} \begin{bmatrix} \xi _{i}\\ \eta _{i}\\ \zeta _{i}\\ \end{bmatrix}_{3p\times 1}&\quad i=1,2,\cdots ,n. \end{aligned}$$

(A.16)

Under Condition (A2), we have

$$\begin{aligned} \text {Var}(W_{n})=\frac{1}{n}\sum _{i=1}^{n}\left[ I_{p\times p},-\frac{I_{p\times p}}{m},-I_{p\times p} \right] \text {Var}\begin{bmatrix} \xi _{i}\\ \eta _{i}\\ \zeta _{i}\\ \end{bmatrix} \begin{bmatrix} I_{p\times p}\\ -\frac{I_{p\times p}}{m}\\ -I_{p\times p}\\ \end{bmatrix} \end{aligned}$$

(A.17)

By routine calculations, we can get

$$\begin{aligned} \begin{aligned} \text {Var}(W_{n})&=\frac{1}{n}\sum _{i=1}^{n}\left[ I_{p\times p},-\frac{I_{p\times p}}{m},-I_{p\times p} \right] \\&\quad \times \text {Var}\begin{bmatrix} &{}ax_{ji}x_{ji}^{T},&{}ax_{ji}x_{ji}^{T},&{}bx_{ji}x_{ji}^{T}\\ &{}ax_{ji}x_{ji}^{T},&{}a\sum _{j=1}^{m}x_{ji}x_{ji}^{T},&{}bx_{ji}x_{ji}^{T}\\ &{}bx_{ji}x_{ji}^{T},&{}bx_{ji}x_{ji}^{T},&{}4\sigma ^{2}x_{ji}x_{ji}^{T}\\ \end{bmatrix} \begin{bmatrix} I_{p\times p}\\ -\frac{I_{p\times p}}{m}\\ -I_{p\times p}\\ \end{bmatrix}\\&=\frac{1}{n}\sum _{i=1}^{n}\left[ \frac{m-2}{m}ax_{ji}x_{ji}^{T} +\frac{2-2m}{m}bx_{ji}x_{ji}^{T}+4\sigma ^{2}x_{ji}x_{ji}^{T} +\frac{a}{m^{2}}\sum _{j=1}^{m}x_{ji}x_{ji}^{T} \right] \\&\quad \xrightarrow {\ \text {P}\ }m^{-1}cI, \end{aligned} \end{aligned}$$

(A.18)

where $n\longrightarrow \infty $ and $c=(m-1)a+(2-2m)b+4m\sigma ^{2}$. By Central limit theorem, we can obtain that

$$\begin{aligned} W_{n}=\frac{1}{\sqrt{n}}\sum _{i=1}^{n}D_{i}\xrightarrow {\ d\ }N(0,m^{-1}cI). \end{aligned}$$

(A.19)

We can know that $W_{n}^{T}u$ is bounded in probability,i.e.,

$$\begin{aligned} W_{n}^{T}u=O_{p}(\sqrt{m^{-1}cu^{T}Iu}). \end{aligned}$$

(A.20)

By applying Lemma 1 to $Z_{n}(u)=S_{n}(u)-W_{n}^{T}u$, we can strengthen this point-wise convergence to uniform convergence on compact subset of ${\mathbb {R}}^{p}$, we now analyze $L_{n}(u)$. For SCAD punishment, $p'_{\lambda }(u)=\lambda I(|u|\le \lambda )+(\max (0,a\lambda -|u|))/(a-1)I(|u|> \lambda )$. If $u\ge a\lambda (a=3.7)$, then $p'_{\lambda }(u)=0$. If $|u|<\lambda $ and $\lambda \longrightarrow 0$, we can get that $p'_{\lambda }(u)=\lambda \longrightarrow 0$.

According to the results above, we can obtain that:

$$\begin{aligned} L_{n}(u)=n\left[ \sum _{k=1}^{q}p'_{\lambda }(|{\overline{\beta }}_{0k}|)\left( |\beta _{0k}+u_{k}/\sqrt{n}|-|\beta _{0k}| \right) \right] =0. \end{aligned}$$

(A.21)

By Condition (A2) and Eqs. (A.2), (A.15), (A.20) and (A.21), $n\left[ {\widetilde{F}}_{SCAD}(\beta _{0}+u/\sqrt{n})-{\widetilde{F}}_{SCAD}(\beta _{0}) \right] $ is dominated by the term $u^{T}\left[ \sum _{i=1}^{n}x_{ji}x_{ji}^{T}/n \right] u$ when $\Vert u\Vert _{2}=M$ large enough. We can get Eq. (A.1). This in turn implies that there exists a local minimizer such that $\Vert {\widehat{\beta }}^{SCAD}-\beta _{0}\Vert _{2}=O_{p}(n^{-1/2})$ as $n\longrightarrow \infty $. Due to $N=mn$, if $\lambda =\lambda (N)\longrightarrow 0$, then ${\widehat{\beta }}^{SCAD}$ converges to $\beta _{0}$ in probability such that $\Vert {\widehat{\beta }}^{SCAD}-\beta _{0}\Vert _{2}=O_{p}(N^{-1/2})$ as $N\longrightarrow \infty $. This completes the proof. $\square $

Lemma 3

Suppose that the sample set $\left\{ x_{i},y_{i}\right\} _{i=1}^{N}$ is generated according to process (3.1). Under Conditions (A1) and (A2), if $\lambda =\lambda (n)\longrightarrow 0$ and $\sqrt{n}\lambda \longrightarrow \infty $ as $n\longrightarrow \infty $, then with probability tending to one, for any given $\beta _{1}$ satisfying $\Vert \beta _{1}-\beta _{10}\Vert _{2}=O_{p}(n^{-1/2})$ and any constant M, we obtain

$$\begin{aligned} (\beta _{1}^{T},0^{T})^{T} =\arg \min \limits _{\Vert \beta _{2}\Vert \le Mn^{-\frac{1}{2}}}{\widetilde{F}}_{SCAD}((\beta _{1}^{T},\beta _{2}^{T})^{T}). \end{aligned}$$

i.e., for any $\delta >0$, we have

$$\begin{aligned} \text {P}\left[ \inf \limits _{\Vert \beta _{2}\Vert _{2}\le Mn^{-1/2}}{\widetilde{F}}_{SCAD}((\beta _{1}^{T},\beta _{2}^{T})^{T}) >{\widetilde{F}}_{SCAD}((\beta _{1}^{T},0^{T})^{T}) \right] \ge 1-\delta . \end{aligned}$$

Proof of Lemma 3

For any $\Vert \beta _{1}-\beta _{10}\Vert _{2}=O_{p}(n^{-1/2})$, $\Vert \beta _{2}\Vert _{2}\le Mn^{-1/2}$, based on the proof of Theorem 1, we can obtain that

$$\begin{aligned}&n\left[ {\widetilde{F}}_{SCAD}((\beta _{1}^{T},0^{T})^{T}) -{\widetilde{F}}_{SCAD}((\beta _{1}^{T},\beta _{2}^{T})^{T}) \right] \nonumber \\&\quad =n\left[ {\widetilde{F}}_{SCAD}((\beta _{1}^{T},0^{T})^{T}) -{\widetilde{F}}_{SCAD}((\beta _{10}^{T},0^{T})^{T}) \right] \nonumber \\&\qquad -n\left[ {\widetilde{F}}_{SCAD}((\beta _{1}^{T},\beta _{2}^{T})^{T}) -{\widetilde{F}}_{SCAD}((\beta _{10}^{T},0^{T})^{T}) \right] \nonumber \\&\quad =S_{n}(\sqrt{n}((\beta _{1}-\beta _{10})^{T},0^{T})^{T}) -S_{n}(\sqrt{n}((\beta _{1}-\beta _{10})^{T},\beta _{2}^{T})^{T})\nonumber \\&\qquad - n\sum _{l=q+1}^{p}p'_{\lambda }(|{\overline{\beta }}_{l}|)|\beta _{l}| \nonumber \\&\quad =\sqrt{n}((\beta _{1}-\beta _{10})^{T},0^{T})\left[ \frac{\sum _{i=1}^{n}x_{ji}x_{ji}^{T}}{n} \right] \sqrt{n}((\beta _{1}-\beta _{10})^{T},0^{T})^{T}\nonumber \\&\qquad +W_{n}^{T}\sqrt{n}((\beta _{1}-\beta _{10})^{T},0^{T})^{T} \nonumber \\&\qquad -\sqrt{n}((\beta _{1}-\beta _{10})^{T},\beta _{2}^{T})\left[ \frac{\sum _{i=1}^{n}x_{ji}x_{ji}^{T}}{n} \right] \sqrt{n}((\beta _{1}-\beta _{10})^{T},\beta _{2}^{T})^{T}\nonumber \\&\qquad -W_{n}^{T}\sqrt{n}((\beta _{1}-\beta _{10})^{T},\beta _{2}^{T})^{T} \nonumber \\&\qquad -n\sum _{l=q+1}^{p}p'_{\lambda }(|{\overline{\beta }}_{l}|)|\beta _{l}|+o_{p}(1). \end{aligned}$$

(A.22)

According to $\Vert \beta _{1}-\beta _{10}\Vert _{2}=O_{p}(n^{-1/2})$, $\Vert \beta _{2}\Vert _{2}\le Mn^{-1/2}$ and (A.20), we can get that

$$\begin{aligned} \begin{aligned}&\sqrt{n}((\beta _{1}-\beta _{10})^{T},0^{T})\left[ \frac{\sum _{i=1}^{n}x_{ji}x_{ji}^{T}}{n} \right] \sqrt{n}((\beta _{1}-\beta _{10})^{T},0^{T})^{T}=O_{p}(1), \\&\sqrt{n}((\beta _{1}-\beta _{10})^{T},\beta _{2}^{T})\left[ \frac{\sum _{i=1}^{n}x_{ji}x_{ji}^{T}}{n} \right] \sqrt{n}((\beta _{1}-\beta _{10})^{T},\beta _{2}^{T})^{T}=O_{p}(1), \\ \end{aligned} \end{aligned}$$

and

$$\begin{aligned} \begin{aligned}&W_{n}^{T}\sqrt{n}((\beta _{1}-\beta _{10})^{T},0^{T})^{T} -W_{n}^{T}\sqrt{n}((\beta _{1}-\beta _{10})^{T},\beta _{2}^{T})^{T} =-\sqrt{n}W_{n}^{T} (0^{T},\beta _{2}^{T})^{T} \\&\quad =O_{p}(\sqrt{nm^{-1}c\beta _{2}^{T}I_{22}\beta _{2}})=O_{p}(1), \end{aligned} \end{aligned}$$

where $I_{22}$ is the right-bottom $(p-q)$-by-$(p-q)$ submatrix of I.

Under the conditions $\lambda =\lambda (n)\longrightarrow 0$ and $\sqrt{n}\lambda \longrightarrow \infty $ as $n\longrightarrow \infty $ and the fact that $\varliminf \limits _{\lambda \longrightarrow 0}\varliminf \limits _{\theta \longrightarrow 0^{+}} p'_{\lambda }(\theta )/\lambda =1$, we have

$$\begin{aligned} \begin{aligned} n\sum _{l=q+1}^{p}p'_{\lambda }(|{\overline{\beta }}_{l}|)|\beta _{l}|&=n\lambda \sum _{l=q+1}^{p}\frac{p'_{\lambda }(|{\overline{\beta }}_{l}|)}{\lambda }|\beta _{l}|\\&\ge n\sum _{l=q+1}^{p}\varliminf \limits _{\lambda \longrightarrow 0}\varliminf \limits _{\theta \longrightarrow 0^{+}}\frac{p'_{\lambda }(\theta )}{\lambda }|\beta _{l}|(1+o(1))\\&=n\lambda \sum _{l=q+1}^{p}|\beta _{l}|(1+o(1)). \end{aligned} \end{aligned}$$

Since $\sqrt{n}\lambda \longrightarrow \infty $, we can get $n\lambda \longrightarrow \infty $, which implies that, $-n\sum _{l=q+1}^{p}p'_{\lambda }(|{\overline{\beta }}_{l}|)|\beta _{l}|$ of (A.22) dominates in magnitude. That is, ${\widetilde{F}}_{SCAD}((\beta _{1}^{T},0^{T})^{T}) -{\widetilde{F}}_{SCAD}((\beta _{1}^{T},\beta _{2}^{T})^{T})<0$ for large n. Then

$$\begin{aligned} \inf \limits _{\Vert \beta _{2}\Vert _{2}\le Mn^{-1/2}}{\widetilde{F}}_{SCAD}((\beta _{1}^{T},\beta _{2}^{T})^{T}) >{\widetilde{F}}_{SCAD}\left( (\beta _{1}^{T},0^{T})^{T} \right) \end{aligned}$$

with probability tending to one on any compact subset of $\Vert \beta _{1}-\beta _{10}\Vert _{2}=O_{p}(n^{-1/2})$ and $\Vert \beta _{2}\Vert _{2}\le Mn^{-1/2}$. This completes the proof. $\square $

Proof of Theorem 2

(I)As discussed in Fan and Li (2001) and by Lemma 3, when $N=mn$, it is obviously that when $\lambda =\lambda (N)\longrightarrow 0$ and $\sqrt{N}\lambda \longrightarrow \infty $ as $N\longrightarrow \infty $, we can obtain that ${\widehat{\beta }}_{2}^{SCAD}=0$. (II)According to Theorem 1, we can demonstrate that there exists a $\sqrt{N}$-consistent minimizer ${\widehat{\beta }}_{1}^{SCAD}$ of ${\widetilde{F}}_{SCAD}\left( (\beta _{1}^{T},0^{T})^{T} \right) $ as a function of $\beta _{1}$.

The proof of Theorem 1 implies that $\sqrt{n}({\widehat{\beta }}_{1}^{SCAD}-\beta _{10})$ minimizes

$$\begin{aligned} S_{n}((\theta ^{T},0^{T})^{T}) +n\sum _{k=1}^{q}p'_{\lambda }(|{\overline{\beta }}_{0k}|) \left[ |\beta _{0k}+\theta _{k}/\sqrt{n}|-|\beta _{0k}| \right] , \end{aligned}$$

(A.23)

with respect to $\theta $, where $\theta =(\theta _{1},\cdots ,\theta _{q})^{T}\in {\mathbb {R}}^{q}$. By the Lemma 1 and (A.15), we can obtain that

$$\begin{aligned} S_{n}((\theta ^{T},0^{T})^{T})= & {} (\theta ^{T},0^{T})\left[ \frac{\sum _{i=1}^{n}x_{ji}x_{ji}^{T}}{n} \right] (\theta ^{T},0^{T})^{T}+W_{n}^{T}(\theta ^{T},0^{T})^{T}+o_{p}(1) \nonumber \\= & {} \theta ^{T}\left[ \frac{\sum _{i=1}^{n}x_{ji}^{1}(x_{ji}^{1})^{T}}{n} \right] \theta +(W_{n}^{1})^{T}\theta +o_{p}(1), \end{aligned}$$

(A.24)

uniformly in any compact subset of ${\mathbb {R}}^{q}$, where $x_{ji}^{1}$ is the q-dimensonal of $x_{ji}$, $\xi _{i}^{1}=2{\overline{\epsilon }}_{ji}x_{ji}^{1}, \eta _{i}^{1}=2\sum _{j=1}^{m}{\overline{\epsilon }}_{ji}x_{ji}^{1}, \zeta _{i}^{1}=2\epsilon _{ji}x_{ji}^{1}, D_{i}^{1}=\xi _{i}^{1}-\eta _{i}^{1}/m-\zeta _{i}^{1},i=1,2,\ldots ,n, W_{n}^{1}=\sum _{i=1}^{n}D_{i}^{1}/\sqrt{n}$. Based on the above evidence and (A.19), we can get

$$\begin{aligned} W_{n}^{1}\xrightarrow {\ d\ }N(0,m^{-1}cI_{11}). \end{aligned}$$

(A.25)

where $I_{11}$ is the q-by-q submatrix of I.

We define

$$\begin{aligned} \begin{aligned} T1&=n\sum _{k=1}^{q}p'_{\lambda }(|{\overline{\beta }}_{0k}|) \left[ |\beta _{0k}+\theta _{k}/\sqrt{n}|-|\beta _{0k}| \right] \\&=\sum _{k=1}^{q}\sqrt{n}p'_{\lambda }(|{\overline{\beta }}_{0k}|) \frac{|\beta _{0k}+\theta _{k}/\sqrt{n}|-|\beta _{0k}|}{1/\sqrt{n}} \\&=\sum _{k=1}^{q}T_{1k}. \end{aligned} \end{aligned}$$

Note that

$$\begin{aligned} \frac{|\beta _{0k}+\theta _{k}/\sqrt{n}|-|\beta _{0k}|}{1/\sqrt{n}}\longrightarrow \text {Sign}(\beta _{0k})\theta _{k}I(\beta _{0k}\ne 0)+|\theta _{k}|I(\beta _{0k}=0). \end{aligned}$$

For the SCAD penalty, $p'_{\lambda }(\theta )=\lambda I(|\theta |\le \lambda )+(\max (0,a\lambda -|\theta |))/(a-1)I(|\theta |> \lambda )$. If $\theta \ge a\lambda (a=3.7)$, then $p'_{\lambda }(\theta )=0$. If $|\theta |<\lambda $ and $\lambda \longrightarrow 0$, we can get that $p'_{\lambda }(\theta )=\lambda \longrightarrow 0$. For $\beta _{0k}\ne 0$, since $|{\overline{\beta }}_{0k}|\xrightarrow {\ \text {P}\ }|\beta _{0k}|>0$, then $\lambda \longrightarrow 0$ ensures $T_{1k}\longrightarrow \sqrt{n}p'_{\lambda }(|{\overline{\beta }}_{0k}|) \text {Sign}(\beta _{0k})\theta _{k}\xrightarrow {\ \text {P}\ }0$. For $\beta _{0k}=0$, $T_{1k}=0$ if $\theta _{k}=0$; when $\theta _{k}\ne 0$, $|{\overline{\beta }}_{0k}|=O_{p}(n^{-1/2})$ and $p'_{\lambda }(\theta )=\lambda $ for $|\theta |<\lambda $, we have if $\sqrt{n}\lambda \longrightarrow \infty $, $T_{1k}=\sqrt{n}p'_{\lambda }(|{\overline{\beta }}_{0k}|)|\theta _{k}|=|\theta _{k}|\sqrt{n}\lambda $ with probability tending to one, thus $T_{1k}\xrightarrow {\ \text {P}\ }\infty $. Let us write $\theta ^{*}=(\theta _{10}^{T},\theta _{20}^{T})^{T}$, then we have

$$\begin{aligned} \begin{aligned}&T_{1}\xrightarrow {\ \text {P}\ }\left\{ \begin{array}{ll} 0,\qquad \theta _{20}=0 \\ \infty ,\qquad otherwise. \end{array} \right. \end{aligned} \end{aligned}$$

In Theorem 2, $\theta ^{*}=(\theta ^{T},0^{T})^{T}$, then we have $T_{1}\xrightarrow {\ \text {P}\ }0$. We further can obtain that

$$\begin{aligned} n\sum _{k=1}^{q}p'_{\lambda }(|{\overline{\beta }}_{0k}|) \left[ |\beta _{0k}+\theta _{k}/\sqrt{n}|-|\beta _{0k}| \right] \xrightarrow {\ \text {P}\ }0. \end{aligned}$$

(A.26)

Based on (A.24) and (A.26) and $\lim \limits _{n\longrightarrow \infty }\sum _{i=1}^{n}x_{ji}^{1}(x_{ji}^{1})^{T}/n=I_{11}$, where $I_{11}$ is the q-by-q submatrix of I, we have

$$\begin{aligned}&S_{n}((\theta ^{T},0^{T})^{T}) +n\sum _{k=1}^{q}p'_{\lambda }(|{\overline{\beta }}_{0k}|) \left[ |\beta _{0k}+\theta _{k}/\sqrt{n}|-|\beta _{0k}| \right] \nonumber \\&\quad =\theta ^{T}I_{11}\theta +(W_{n}^{1})^{T}\theta +o_{p}(1). \end{aligned}$$

(A.27)

According (A.25) and Lemma 2, we can obtain that

$$\begin{aligned} \sqrt{n}({\widehat{\beta }}_{1}^{SCAD}-\beta _{10})\xrightarrow {\ d\ }N(0,\frac{c}{4m}I_{11}^{-1}). \end{aligned}$$

(A.28)

Due to $N=mn$, we can get that

$$\begin{aligned} \sqrt{N}({\widehat{\beta }}_{1}^{SCAD}-\beta _{10})\xrightarrow {\ d\ }N(0,\frac{c}{4}I_{11}^{-1}). \end{aligned}$$

(A.29)

This completes the proof. $\square $

Proof of Theorem 3

(I)For any $\Vert \beta _{1}-\beta _{10}\Vert _{2}=O_{p}(n^{-1/2})$, $\Vert \beta _{2}\Vert \le Mn^{-1/2}$, according to (A.22), we can get that

$$\begin{aligned}&n\left[ {\widetilde{F}}_{AL}((\beta _{1}^{T},0^{T})^{T}) -{\widetilde{F}}_{AL}((\beta _{1}^{T},\beta _{2}^{T})^{T}) \right] \nonumber \\&\quad =S_{n}(\sqrt{n}((\beta _{1}-\beta _{10})^{T},0^{T})^{T}) -S_{n}(\sqrt{n}((\beta _{1}-\beta _{10})^{T},\beta _{2}^{T})^{T})- n\lambda \sum _{k=q+1}^{p}{\overline{w}}_{k}|\beta _{k}|. \end{aligned}$$

(A.30)

Note here the first two terms of (A.30) are exactly the same as in (A.22) and hence can be bounded similarly. However the third term, due to $n^{(1+r)/2}\lambda \longrightarrow \infty $ and $\sqrt{n}{\widehat{\beta }}_{k}=O_{p}(1)$, satisfies

$$\begin{aligned} -n\lambda \sum _{k=q+1}^{p}{\overline{w}}_{k}|\beta _{k}| =-[n^{(1+r)/2}\lambda ]\sqrt{n}\left[ \sum _{k=q+1}^{p}\left| \left( \sqrt{n}|{\widehat{\beta }}_{k} | \right) ^{-r}\right| |\beta _{k}| \right] \longrightarrow -\infty . \end{aligned}$$

These facts in turn implies that, for $n\longrightarrow \infty $

$$\begin{aligned} n\left[ {\widetilde{F}}_{AL}((\beta _{1}^{T},0^{T})^{T}) -{\widetilde{F}}_{AL}((\beta _{1}^{T},\beta _{2}^{T})^{T}) \right] <0. \end{aligned}$$

Then we have

$$\begin{aligned} \text {P}\left[ \inf \limits _{\Vert \beta _{2}\Vert _{2}=Mn^{-1/2}}{\widetilde{F}}_{AL}((\beta _{1}^{T},\beta _{2}^{T})^{T})>{\widetilde{F}}_{AL}((\beta _{1}^{T},0^{T})^{T}) \right] \ge 1-\delta . \end{aligned}$$

Owing to $N=mn$ if $\lambda =\lambda (N),\sqrt{N}\lambda \longrightarrow 0$ and $N^{(1+r)/2}\lambda \longrightarrow \infty $ as $N\longrightarrow \infty $, we can obtain that ${\widehat{\beta }}_{2}^{AL}=0$.

(II) According to (A.2),

$$\begin{aligned}&n\left[ {\widetilde{F}}_{AL}(\beta _{0}+u/\sqrt{n})-{\widetilde{F}}_{AL}(\beta _{0}) \right] \nonumber \\&\quad =\sum _{i=1}^{n}\left[ \rho (\epsilon _{ji}-x_{ji}^{T}u/\sqrt{n})-\rho (\epsilon _{ji}) \right] +n\left[ \langle F'_{N}({\overline{\beta }})-F'_{j}({\overline{\beta }}),u/\sqrt{n}\rangle \right] \nonumber \\&\qquad +n\lambda \sum _{k=1}^{p}\left[ {\overline{w}}_{k}|\beta _{0k}+u_{k}/\sqrt{n}|-{\overline{w}}_{k}|\beta _{0k}| \right] . \end{aligned}$$

(A.31)

We analyze the third term. For $k=1,2,\cdots ,q$, ${\overline{w}}_{k}\xrightarrow {\ \text {P}\ }|\beta _{0k}|^{-r}$, we have $\beta _{0k}\ne 0$, by routine calculation, we get

$$\begin{aligned} \begin{aligned}&n\lambda \sum _{k=1}^{q}\left[ {\overline{w}}_{k}|\beta _{0k}+u_{k}/\sqrt{n}|-{\overline{w}}_{k}|\beta _{0k}| \right] \\&\quad =\sqrt{n}\lambda {\overline{w}}_{k}\sum _{k=1}^{q}\frac{\left[ {\overline{w}}_{k}|\beta _{0k}+u_{k}/\sqrt{n}|-{\overline{w}}_{k}|\beta _{0k}| \right] }{1/\sqrt{n}} \\&\quad =\sqrt{n}\lambda {\overline{w}}_{k}\sum _{k=1}^{q} u_{k}\text {Sign}(\beta _{0k}), \end{aligned} \end{aligned}$$

and by $\sqrt{n}\lambda \longrightarrow 0$, we can obtain that

$$\begin{aligned} n\lambda \sum _{k=1}^{q}\left[ {\overline{w}}_{k}|\beta _{0k}+u_{k}/\sqrt{n}| -{\overline{w}}_{k}|\beta _{0k}| \right] \xrightarrow {\ \text {P}\ }0. \end{aligned}$$

(A.32)

For $k=q+1,q+2,\cdots ,p$ and the true coefficient $\beta _{0k}=0$, we have

$$\begin{aligned} n\lambda \sum _{k=q+1}^{p}\left[ {\overline{w}}_{k}|\beta _{0k}+u_{k}/\sqrt{n}| -{\overline{w}}_{k}|\beta _{0k}| \right] =\sqrt{n}\lambda \sum _{k=q+1}^{p}[{\overline{w}}_{k}|u_{k}|]. \end{aligned}$$

When $u_{k}\ne 0$, $\sqrt{n}\lambda {\overline{w}}_{k}=n^{(1+r)/2}\lambda (\sqrt{n}|{\widehat{\beta }}_{k}|)^{-r}$, $n^{(1+r)/2}\lambda \longrightarrow \infty $ and $\sqrt{n}|{\widehat{\beta }}_{k}|=O_{p}(1)$, we can get that

$$\begin{aligned} n\lambda \left[ {\overline{w}}_{k}|\beta _{0k}+u_{k}/\sqrt{n}| -{\overline{w}}_{k}|\beta _{0k}| \right] \longrightarrow \infty . \end{aligned}$$

(A.33)

When $u_{k}=0$, $n\lambda \left[ {\overline{w}}_{k}|\beta _{0k}+u_{k}/\sqrt{n}| -{\overline{w}}_{k}|\beta _{0k}| \right] =0$.

We define $u^{*}=(u_{10}^{T},u_{20}^{T})^{T}$ and $W_{n}=((W_{n}^{1})^{T},(W_{n}^{2})^{T})^{T}$. Combining (A.31), (A.32), (A.33), Theorem 1 and Condition (A2), we conclude that for each fixed u,

$$\begin{aligned} \begin{aligned}&n\left[ {\widetilde{F}}_{AL}(\beta _{0}+u/\sqrt{n})-{\widetilde{F}}_{AL}(\beta _{0}) \right] \xrightarrow {\ d\ }V(u)\\&\quad = \left\{ \begin{array}{ll} (u^{1})^{T}I_{11}u^{1}+(W_{n}^{1})^{T}u^{1},\quad if \quad u_{20}=0, \\ \infty ,\qquad otherwise, \end{array} \right. \end{aligned} \end{aligned}$$

(A.34)

where $u^{1}=(u_{1},u_{2},\cdots ,u_{q})^{T}$. Noted that $n\left[ {\widetilde{F}}_{AL}(\beta _{0}+u/\sqrt{n})-{\widetilde{F}}_{AL}(\beta _{0}) \right] $ is convex in u and V(u) has a unique minimizer, then we have

$$\begin{aligned} \arg \min \limits _{u\in {\mathbb {R}}^{q}}n\left[ {\widetilde{F}}_{AL}(\beta _{0}+u/\sqrt{n})-{\widetilde{F}}_{AL}(\beta _{0}) \right] =\sqrt{n}({\widehat{\beta }}_{1}^{(AL)}-\beta _{0}) \xrightarrow {\ d\ }\arg \min \limits _{u\in {\mathbb {R}}^{q}}V(u).\nonumber \\ \end{aligned}$$

(A.35)

According to (A.25), (A.34), (A.35) and Lemma 2, we can obtain that

$$\begin{aligned} \sqrt{n}({\widehat{\beta }}_{1}^{AL}-\beta _{10})\xrightarrow {\ d\ }N(0,\frac{c}{4m}I_{11}^{-1}). \end{aligned}$$

(A.36)

Due to $N=mn$, if $\sqrt{N}\lambda \longrightarrow 0$ and $N^{(1+r)/2}\lambda \longrightarrow \infty $ as $N\longrightarrow \infty $, we can get that

$$\begin{aligned} \sqrt{N}({\widehat{\beta }}_{1}^{AL}-\beta _{10})\xrightarrow {\ d\ }N(0,\frac{c}{4}I_{11}^{-1}). \end{aligned}$$

(A.37)

This completes the proof. $\square $

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, Z., Zhao, X. & Pan, Y. Communication-efficient distributed estimation for high-dimensional large-scale linear regression. Metrika 86, 455–485 (2023). https://doi.org/10.1007/s00184-022-00878-x

Download citation

Received: 26 August 2021
Accepted: 14 July 2022
Published: 11 August 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s00184-022-00878-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Communication-efficient distributed estimation for high-dimensional large-scale linear regression

Abstract

Access this article

Similar content being viewed by others

The Frank-Wolfe Algorithm: A Short Introduction

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Multi-consensus decentralized primal-dual fixed point algorithm for distributed learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Proof of Theorem

Lemma 1

Lemma 2

Proof of Theorem 1

Lemma 3

Proof of Lemma 3

Proof of Theorem 2

Proof of Theorem 3

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Communication-efficient distributed estimation for high-dimensional large-scale linear regression

Abstract

Access this article

Similar content being viewed by others

The Frank-Wolfe Algorithm: A Short Introduction

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Multi-consensus decentralized primal-dual fixed point algorithm for distributed learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Proof of Theorem

Appendix: Proof of Theorem

Lemma 1

Lemma 2

Proof of Theorem 1

Lemma 3

Proof of Lemma 3

Proof of Theorem 2

Proof of Theorem 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation