Skip to main content

Communication-efficient distributed estimation for high-dimensional large-scale linear regression

Abstract

In the Master-Worker distributed structure, this paper provides a regularized gradient-enhanced loss (GEL) function based on the high-dimensional large-scale linear regression with SCAD and adaptive LASSO penalty. The importance and originality of this paper have two aspects: (1) Computationally, to take full advantage of the computing power of each machine and speed up the convergence, our proposed distributed upgraded estimation method can make all Workers optimize their corresponding GEL functions in parallel, and the results are then aggregated by the Master; (2) In terms of communication, the proposed modified proximal alternating direction method of the multipliers (ADMM) algorithm is comparable to the Centralize method based on the full sample during a few rounds of communication. Under some mild assumptions, we establish the Oracle properties of the SCAD and adaptive LASSO penalized linear regression. The finite sample properties of the newly suggested method are assessed through simulation studies. An application to the HIV drug susceptibility study demonstrates the utility of the proposed method in practice.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2

References

  • Attouch H (2020) Fast inertial proximal ADMM algorithms for convex structured optimization with linear constraint. https://hal.archives-ouvertes.fr/hal-02501604

  • Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine learning 3(1):1–122

    MATH  Article  Google Scholar 

  • Chen X, Liu W, Mao X, Yang Z (2020) Distributed high-dimensional regression under a quantile loss function. J Mach Learn Res 21(182):1–43

    MathSciNet  MATH  Google Scholar 

  • Cheng G, Shang Z (2015) Computational limits of divide-and-conquer method. arXiv preprint arXiv:1512.09226, 8

  • Coffin JM (1995) HIV population dynamics in vivo: implications for genetic variation, pathogenesis, and therapy. Science 267(5197):483–489

    Article  Google Scholar 

  • Duchi JC, Agarwal A, Wainwright MJ (2012) Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Trans Autom Control 57(3):592–606

    MathSciNet  MATH  Article  Google Scholar 

  • Eckstein J (1994) Some saddle-function splitting methods for convex programming. Optimization Methods and Software 4(1):75–83

    Article  Google Scholar 

  • Fan J, Guo Y, Wang K (2021) Communication-efficient accurate statistical estimation. Journal of the American Statistical Association, 1-11

  • Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360

    MathSciNet  MATH  Article  Google Scholar 

  • Fan Y, Lin N, Yin X (2021) Penalized quantile regression for distributed big data using the slack variable representation. J Comput Graph Stat 30(3):557–565

    MathSciNet  MATH  Article  Google Scholar 

  • Fazel M, Pong TK, Sun D, Tseng P (2013) Hankel matrix rank minimization with applications to system identification and realization. SIAM J Matrix Anal Appl 34(3):946–977

    MathSciNet  MATH  Article  Google Scholar 

  • Gao Y, Liu W, Wang H, Wang X, Yan Y, Zhang R (2021) A review of distributed statistical inference. Statistical Theory and Related Fields, 1-11

  • Gu Y, Fan J, Kong L, Ma S, Zou H (2018) ADMM for high-dimensional sparse penalized quantile regression. Technometrics 60(3):319–331

    MathSciNet  Article  Google Scholar 

  • He B, Liao LZ, Han D, Yang H (2002) A new inexact alternating directions method for monotone variational inequalities. Math Program 92(1):103–118

    MathSciNet  MATH  Article  Google Scholar 

  • Hjort NL, Pollard D (2011) Asymptotics for minimisers of convex processes. arXiv preprint arXiv:1107.3806

  • Hu A, Li C, Wu J (2021) Communication-efficient modeling with penalized quantile regression for distributed data. Complexity, 2021

  • Huang C, Huo X (2019) A distributed one-step estimator. Math Program 174(1):41–76

    MathSciNet  MATH  Article  Google Scholar 

  • Jordan MI, Lee JD, Yang Y (2019) Communication-efficient distributed statistical inference. J Am Stat Assoc 114(526):668–681

    MathSciNet  MATH  Article  Google Scholar 

  • Kannan R, Vempala S, Woodruff D (2014) Principal component analysis and higher correlations for distributed data. In Conference on Learning Theory 35:1040–1057

    Google Scholar 

  • Lee JD, Sun Y, Liu Q, Taylor JE (2015) Communication-efficient sparse regression: a one-shot approach. arXiv preprint arXiv:1503-1503

  • Lian H, Liu J, Fan Z (2021) Distributed learning for sketched kernel regression. Neural Netw 143:368–376

    Article  Google Scholar 

  • Lin SB, Guo X, Zhou DX (2017) Distributed learning with regularized least squares. The Journal of Machine Learning Research 18(1):3202–3232

    MathSciNet  Google Scholar 

  • Lu J, Cheng G, Liu H (2016) Nonparametric heterogeneity testing for massive data. arXiv preprint arXiv:1601.06212

  • Mcdonald R, Mohri M, Silberman N, Walker D, Mann G (2009) Efficient large-scale distributed training of conditional maximum entropy models. Adv Neural Inf Process Syst 22:1231–1239

    Google Scholar 

  • Pan Y, Liu Z, Cai W (2020) Large-scale expectile regression with covariates missing at random. IEEE Access 8:36502–36513

    Article  Google Scholar 

  • Pan Y (2021) Distributed optimization and statistical learning for large-scale penalized expectile regression. Journal of the Korean Statistical Society 50(1):290–314

    MathSciNet  MATH  Article  Google Scholar 

  • Pollard D (1991) Asymptotics for least absolute deviation regression estimators. Economet Theor 7(2):186–199

    MathSciNet  Article  Google Scholar 

  • Rhee SY, Gonzales MJ, Kantor R, Betts BJ, Ravela J, Shafer RW (2003) Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Res 31(1):298–303

    Article  Google Scholar 

  • Rosenblatt JD, Nadler B (2016) On the optimality of averaging in distributed statistical learning. Information and Inference: A Journal of the IMA 5(4):379–404

    MathSciNet  MATH  Article  Google Scholar 

  • Shamir O, Srebro N, Zhang T (2014) Communication-efficient distributed optimization using an approximate newton-type method. In International Conference on Machine Learning 32(2):1000–1008

    Google Scholar 

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc: Ser B (Methodol) 58(1):267–288

    MathSciNet  MATH  Google Scholar 

  • Wang HJ, McKeague IW, Qian M (2018) Testing for marginal linear effects in quantile regression. J R Stat Soc Ser B Stat Methodol 80(2):433–433

    MathSciNet  MATH  Article  Google Scholar 

  • Wang J, Kolar M, Srebro N, Zhang T (2017) Efficient distributed learning with sparsity. In International Conference on Machine Learning 70:3636–3645

    Google Scholar 

  • Wang L, Lian H (2020) Communication-efficient estimation of high-dimensional quantile regression. Anal Appl 18(06):1057–1075

    MathSciNet  MATH  Article  Google Scholar 

  • Xu G, Shang Z, Cheng G (2018) Optimal tuning for divide-and-conquer kernel ridge regression with massive data. In International Conference on Machine Learning 80:5483–5491

    Google Scholar 

  • Zhang Y, Duchi JC, Wainwright MJ (2013) Communication-efficient algorithms for statistical optimization. The Journal of Machine Learning Research 14(1):3321–3363

    MathSciNet  MATH  Google Scholar 

  • Zhang Y, Lin X (2015) DiSCO: Distributed optimization for self-concordant empirical loss. In International conference on machine learning 37:362–370

    Google Scholar 

  • Zhang CH, Zhang T (2012) A general theory of concave regularization for high-dimensional sparse estimation problems. Stat Sci 27(4):576–593

    MathSciNet  MATH  Article  Google Scholar 

  • Zhao W, Zhang F, Lian H (2019) Debiasing and distributed estimation for high-dimensional quantile regression. IEEE Transactions on Neural Networks and Learning Systems 31(7):2569–2577

    MathSciNet  Google Scholar 

  • Zhu X, Li F, Wang H (2021) Least-square approximation for a distributed system. J Comput Graph Stat 30:1004–1018

    MathSciNet  MATH  Article  Google Scholar 

  • Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429

    MathSciNet  MATH  Article  Google Scholar 

  • Zou H, Li R (2008) One-step sparse estimates in nonconcave penalized likelihood models. Ann Stat 36(4):1509–1509

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (NSFC) (No. 11901175), the Science and Technology Research Project of Hubei Education Department (No. Q20211007).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yingli Pan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Proof of Theorem

Appendix: Proof of Theorem

In this section, all needed lemmas and proofs are listed.

Lemma 1

Let \(\left\{ S_{n}(u):u\in U\right\} \) be a sequence of random convex functions defined on convex, open subset U of \({\mathbb {R}}^{p}\). Suppose \(S(\cdot )\) is a real-valued function on U for \(S_{n}(u)\longrightarrow S(u)\) in probability for each \(u\in U\). Then for each compact subset K of U, \(\sup \limits _{u\in K}|S_{n}(u)-S(u)|\longrightarrow 0\) in probability and the function \(S(\cdot )\) is necessarily convex on U.

The proof of Lemma 1 can be found in Pollard (1991).

Lemma 2

Let V be a symmetric and positive definite matrix, W be a random variable and \(A_{n}(u)\) be a convex objective function. If

$$\begin{aligned} A_{n}(u)=\frac{1}{2}u^{T}Vu+W^{T}u+o_{p}(1), \end{aligned}$$

then \(\alpha _{n}\), the \(\arg \min \) of \(A_{n}(u)\), satisfies: \(\alpha _{n} \xrightarrow {\ d\ }-V^{-1}W\).

The proof of Lemma 2 can be found in Hjort and Pollard (2011).

Proof of Theorem 1

In this case, note that \({\widetilde{F}}(\beta )\), defined in (2.7), is not a convex function since the non-convexity of SCAD penalty. Based on this fact, we should consider the local minimizer instead of the global \({\widehat{\beta }}^{SCAD}\).

According to Lemma 1 and Fan and Li (2001), for any given \(\delta >0\), there is a large enough constant M such that

$$\begin{aligned} \text {P}\left[ \inf \limits _{\Vert u\Vert _{2}=M}{\widetilde{F}}_{SCAD}(\beta _{0}+u/\sqrt{n})>{\widetilde{F}}_{SCAD}(\beta _{0}) \right] \ge 1-\delta , \end{aligned}$$
(A.1)

which implies that with probability at least \(1-\delta \) there exists a local minimum in the ball \(\left\{ \beta _{0}+u/\sqrt{n}:\Vert u\Vert _{2}\le M\right\} \). This in turn implies that there exists a local minimizer such that \(\Vert {\widehat{\beta }}^{(SCAD)}-\beta _{0}\Vert _{2}=O_{p}(n^{-1/2})\). Performing some simple calculations, we have

$$\begin{aligned}&n\left[ {\widetilde{F}}_{SCAD}(\beta _{0}+u/\sqrt{n})-{\widetilde{F}}_{SCAD}(\beta _{0}) \right] \nonumber \\&\quad =n\left[ F_{j}(\beta _{0}+u/\sqrt{n})-F_{j}(\beta _{0}) \right] +n\left[ \langle F'_{N}({\overline{\beta }})-F'_{j}({\overline{\beta }}),u/\sqrt{n}\rangle \right] \nonumber \\&\qquad +n\left[ \sum _{k=1}^{p}p'_{\lambda }(|{\overline{\beta }}_{0k}|)|\beta _{0k}+u_{k}/\sqrt{n}| \right] -n\left[ \sum _{k=1}^{p}p'_{\lambda }(|{\overline{\beta }}_{0k}|)|\beta _{0k}| \right] \nonumber \\&\quad =\sum _{i=1}^{n}\left[ \rho (\epsilon _{ji}-x_{ji}^{T}u/\sqrt{n})-\rho (\epsilon _{ji}) \right] +n\left[ \langle F'_{N}({\overline{\beta }})-F'_{j}({\overline{\beta }}),u/\sqrt{n}\rangle \right] \nonumber \\&\qquad +n\left[ \sum _{k=1}^{p}p'_{\lambda }(|{\overline{\beta }}_{0k}|)\left( |\beta _{0k}+u_{k}/\sqrt{n}|-|\beta _{0k}| \right) \right] \nonumber \\&\quad \ge S_{n}(u)+L_{n}(u), \end{aligned}$$
(A.2)

where \(\epsilon _{ji}=y_{ji}-x_{ji}^{T}\beta _{0}\), and \(q<p\)

$$\begin{aligned} S_{n}(u)&=\sum _{i=1}^{n}\left[ \rho (\epsilon _{ji}-x_{ji}^{T}u/\sqrt{n})-\rho (\epsilon _{ji}) \right] +n\left[ \langle F'_{N}({\overline{\beta }})-F'_{j}({\overline{\beta }}),u/\sqrt{n}\rangle \right] , \end{aligned}$$
(A.3)
$$\begin{aligned} L_{n}(u)&=n\left[ \sum _{k=1}^{q}p'_{\lambda }(|{\overline{\beta }}_{0k}|)\left( |\beta _{0k}+u_{k}/\sqrt{n}|-|\beta _{0k}| \right) \right] . \end{aligned}$$
(A.4)

Given any fixed u, then

$$\begin{aligned} S_{n}(u)&=\sum _{i=1}^{n}\left[ \rho (\epsilon _{ji}-x_{ji}^{T}u/\sqrt{n})-\rho (\epsilon _{ji}) \right] +n\left[ \langle F'_{N}({\overline{\beta }})- F'_{j}({\overline{\beta }}),u/\sqrt{n}\rangle \right] \nonumber \\&=S1+S2. \end{aligned}$$
(A.5)

Due to \(\rho (u)=u^{2}\), then \(\rho '(u)=2u\) and \(\rho '(\epsilon _{ji})=2\epsilon _{ji}\). According to \(E(\epsilon _{ji})=0\) in Condition (A1), we have \(2E(\epsilon _{ji})=0\). Denote \(Q_{ji}(t)=\text {E}[\rho (\epsilon _{ji}-t)-\rho (\epsilon _{ji})]\), by the second-order Taylor expansion of \(Q_{ji}(t)\) at 0, we have

$$\begin{aligned} Q_{ji}(t)=\text {E}\left[ (\epsilon _{ji}-t)^{2}-(\epsilon _{ji})^{2} \right] =t^{2}+o(t^{2}). \end{aligned}$$
(A.6)

According to Condition (A2), we have \(\sum _{i=1}^{n}x_{i}x_{i}^{T}/{n}\longrightarrow I\). Further, we have

$$\begin{aligned}&\text {E}\left[ \sum _{i=1}^{n}\rho (\epsilon _{ji}-x_{ji}^{T}u/\sqrt{n})-\rho (\epsilon _{ji}) \right] \nonumber \\&\quad =\sum _{i=1}^{n}Q_{ji}(x_{ji}^{T}u/\sqrt{n}) \nonumber \\&\quad =\sum _{i=1}^{n}\left( \frac{x_{ji}^{T}u}{\sqrt{n}} \right) ^{T}\left( \frac{x_{ji}^{T}u}{\sqrt{n}} \right) +o(1) \nonumber \\&\quad =u^{T}\frac{\sum _{i=1}^{n}x_{i}x_{i}^{T}}{n}u+o(1), \end{aligned}$$
(A.7)

where u is fixed.

Denote \(R_{ji,n}=\rho (\epsilon _{ji}-x_{ji}^{T}u/\sqrt{n})-\rho (\epsilon _{ji})+2\epsilon _{ji} x_{ji}^{T}u/\sqrt{n}\), then by Taylor expansion, there exist \(g_{ji}\) between \(\epsilon _{ji}-x_{ji}^{T}u/\sqrt{n}\) and \(\epsilon _{ji}\) such that

$$\begin{aligned} |R_{ji,n}|=\left| \frac{\rho ''(g_{ji})}{2}\right| u^{T}\frac{x_{ji}x_{ji}^{T}}{n}u =u^{T}\frac{x_{ji}x_{ji}^{T}}{n}u, \end{aligned}$$
(A.8)

since \(\rho ''(g_{ji})=2\). Then we have

$$\begin{aligned} \sum _{i=1}^{n}\text {E}R_{ji,n}^{2}\le \sum _{i=1}^{n}\left[ u^{T}\frac{x_{ji}x_{ji}^{T}}{n}u \times \max \limits _{1\le i\le n}\left( \frac{\Vert x_{ji}\Vert }{\sqrt{n}} \right) ^{2}\times \Vert u\Vert ^{2} \right] . \end{aligned}$$
(A.9)

Under the Condition (A2), we have \(\sum _{i=1}^{n} u^{T}\left[ (x_{ji}x_{ji}^{T})/{n} \right] u\longrightarrow u^{T}Iu\). Then, by the fact that \(\max \limits _{1\le i\le n}(\Vert x_{ji}\Vert /{\sqrt{n}})\longrightarrow 0\) and \(\Vert u\Vert _{2}=M\), we have

$$\begin{aligned} \sum _{i=1}^{n}\text {E}R_{ji,n}^{2}\longrightarrow 0. \end{aligned}$$
(A.10)

Based on the above results, we have

$$\begin{aligned} \begin{aligned} S1&=\text {E}\left[ \sum _{i=1}^{n}\left[ \rho (\epsilon _{ji}-x_{ji}^{T}u/\sqrt{n})-\rho (\epsilon _{ji}) \right] \right] +\sum _{i=1}^{n}\left[ \rho (\epsilon _{ji}-x_{ji}^{T}u/\sqrt{n})-\rho (\epsilon _{ji}) \right] \\&\quad -\text {E}\left[ \sum _{i=1}^{n}\left[ \rho (\epsilon _{ji}-x_{ji}^{T}u/\sqrt{n})-\rho (\epsilon _{ji}) \right] \right] \\&=u^{T}\frac{\sum _{i=1}^{n}x_{ji}x_{ji}^{T}}{n}u-\left[ \frac{2}{\sqrt{n}}\sum _{i=1}^{n}\epsilon _{ji}x_{ji} \right] ^{T}u+\sum _{i=1}^{n}(R_{ji,n}-\text {E}R_{ji,n})+o_{p}(1). \end{aligned} \end{aligned}$$
(A.11)

Combining \(\text {E}\left[ \sum _{i=1}^{n}(R_{ji,n}-\text {E}R_{ji,n}) \right] ^{2}\le \sum _{i=1}^{n}\text {E}R_{ji,n}^{2}\) with (A.11), we can obtain that \(\sum _{i=1}^{n}(R_{ji,n}-\text {E}R_{ji,n})\longrightarrow 0\). Thus, we can get

$$\begin{aligned} S1=u^{T}\frac{\sum _{i=1}^{n}x_{ji}x_{ji}^{T}}{n}u -\left[ \frac{2}{\sqrt{n}}\sum _{i=1}^{n}\epsilon _{ji}x_{ji} \right] ^{T}u+o_{p}(1). \end{aligned}$$
(A.12)

For the S2, we have

$$\begin{aligned} F'_{N}({\overline{\beta }})-F'_{j}({\overline{\beta }})=\frac{2}{n}\sum _{i=1}^{n}\left[ {\overline{\epsilon }}_{ji}x_{ji} -\frac{1}{m}\sum _{j=1}^{m}{\overline{\epsilon }}_{ji}x_{ji} \right] , \end{aligned}$$
(A.13)

where \({\overline{\epsilon }}_{ji}=y_{ji}-x_{ji}^{T}{\overline{\beta }}\). Further, we have

$$\begin{aligned} S2=n\left[ \langle F'_{N}({\overline{\beta }})-F'_{j}({\overline{\beta }}),u/\sqrt{n}\rangle \right] =\frac{2}{\sqrt{n}}\sum _{i=1}^{n}\left[ {\overline{\epsilon }}_{ji}x_{ji} -\frac{1}{m}\sum _{j=1}^{m}{\overline{\epsilon }}_{ji}x_{ji} \right] ^{T}u. \end{aligned}$$
(A.14)

Now we can simplify \(S_{n}(u)\) to

$$\begin{aligned} \begin{aligned} S_{n}(u)&=S1+S2\\&=u^{T}\frac{\sum _{i=1}^{n}x_{ji}x_{ji}^{T}}{n}u -\left[ \frac{2}{\sqrt{n}}\sum _{i=1}^{n}\epsilon _{ji}x_{ji} \right] ^{T}u+o_{p}(1)\\&\quad + \frac{2}{\sqrt{n}}\sum _{i=1}^{n}\left[ {\overline{\epsilon }}_{ji}x_{ji} -\frac{1}{m}\sum _{j=1}^{m}{\overline{\epsilon }}_{ji}x_{ji} \right] ^{T}u\\&=u^{T}\frac{\sum _{i=1}^{n}x_{ji}x_{ji}^{T}}{n}u+\left( \frac{1}{\sqrt{n}} \sum _{i=1}^{n}D_{i} \right) ^{T}u+o_{p}(1)\\&=u^{T}\frac{\sum _{i=1}^{n}x_{ji}x_{ji}^{T}}{n}u+W_{n}^{T}u+o_{p}(1), \end{aligned} \end{aligned}$$
(A.15)

where \(\xi _{i}=2{\overline{\epsilon }}_{ji}x_{ji}, \eta _{i}=2\sum _{j=1}^{m}{\overline{\epsilon }}_{ji}x_{ji}, \zeta _{i}=2\epsilon _{ji}x_{ji}, D_{i}=\xi _{i}-{\eta _{i}}/{m}-\zeta _{i},i=1,2,\ldots ,n, W_{n}=({1}/{\sqrt{n}})\sum _{i=1}^{n}D_{i}\).

We deal with \(W_{n}\), according to Condition (A1), we can get

$$\begin{aligned} \begin{aligned}&\text {Cov}(\xi _{i},\xi _{i})=ax_{ji}x_{ji}^{T}, \quad \text {Cov}(\eta _{i},\eta _{i})=a\sum _{j=1}^{m}x_{ji}x_{ji}^{T}, \quad \text {Cov}(\zeta _{i},\zeta _{i})=4\sigma ^{2}x_{ji}x_{ji}^{T}, \quad \\&\text {Cov}(\xi _{i},\eta _{i})=ax_{ji}x_{ji}^{T}, \quad \text {Cov}(\xi _{i},\zeta _{i})=bx_{ji}x_{ji}^{T}, \quad \text {Cov}(\eta _{i},\zeta _{i})=bx_{ji}x_{ji}^{T}, \quad \end{aligned} \end{aligned}$$

where \(a=4\text {Var}({\overline{\epsilon }}_{ji}), b=4\text {Cov}({\overline{\epsilon }}_{ji},\epsilon _{ji})\). Since \(D_{i}\) is independent and identically distributed zero-mean random vectors, we have

$$\begin{aligned} D_{i}=\left[ I_{p\times p},-\frac{I_{p\times p}}{m},-I_{p\times p} \right] _{p\times 3p} \begin{bmatrix} \xi _{i}\\ \eta _{i}\\ \zeta _{i}\\ \end{bmatrix}_{3p\times 1}&\quad i=1,2,\cdots ,n. \end{aligned}$$
(A.16)

Under Condition (A2), we have

$$\begin{aligned} \text {Var}(W_{n})=\frac{1}{n}\sum _{i=1}^{n}\left[ I_{p\times p},-\frac{I_{p\times p}}{m},-I_{p\times p} \right] \text {Var}\begin{bmatrix} \xi _{i}\\ \eta _{i}\\ \zeta _{i}\\ \end{bmatrix} \begin{bmatrix} I_{p\times p}\\ -\frac{I_{p\times p}}{m}\\ -I_{p\times p}\\ \end{bmatrix} \end{aligned}$$
(A.17)

By routine calculations, we can get

$$\begin{aligned} \begin{aligned} \text {Var}(W_{n})&=\frac{1}{n}\sum _{i=1}^{n}\left[ I_{p\times p},-\frac{I_{p\times p}}{m},-I_{p\times p} \right] \\&\quad \times \text {Var}\begin{bmatrix} &{}ax_{ji}x_{ji}^{T},&{}ax_{ji}x_{ji}^{T},&{}bx_{ji}x_{ji}^{T}\\ &{}ax_{ji}x_{ji}^{T},&{}a\sum _{j=1}^{m}x_{ji}x_{ji}^{T},&{}bx_{ji}x_{ji}^{T}\\ &{}bx_{ji}x_{ji}^{T},&{}bx_{ji}x_{ji}^{T},&{}4\sigma ^{2}x_{ji}x_{ji}^{T}\\ \end{bmatrix} \begin{bmatrix} I_{p\times p}\\ -\frac{I_{p\times p}}{m}\\ -I_{p\times p}\\ \end{bmatrix}\\&=\frac{1}{n}\sum _{i=1}^{n}\left[ \frac{m-2}{m}ax_{ji}x_{ji}^{T} +\frac{2-2m}{m}bx_{ji}x_{ji}^{T}+4\sigma ^{2}x_{ji}x_{ji}^{T} +\frac{a}{m^{2}}\sum _{j=1}^{m}x_{ji}x_{ji}^{T} \right] \\&\quad \xrightarrow {\ \text {P}\ }m^{-1}cI, \end{aligned} \end{aligned}$$
(A.18)

where \(n\longrightarrow \infty \) and \(c=(m-1)a+(2-2m)b+4m\sigma ^{2}\). By Central limit theorem, we can obtain that

$$\begin{aligned} W_{n}=\frac{1}{\sqrt{n}}\sum _{i=1}^{n}D_{i}\xrightarrow {\ d\ }N(0,m^{-1}cI). \end{aligned}$$
(A.19)

We can know that \(W_{n}^{T}u\) is bounded in probability,i.e.,

$$\begin{aligned} W_{n}^{T}u=O_{p}(\sqrt{m^{-1}cu^{T}Iu}). \end{aligned}$$
(A.20)

By applying Lemma 1 to \(Z_{n}(u)=S_{n}(u)-W_{n}^{T}u\), we can strengthen this point-wise convergence to uniform convergence on compact subset of \({\mathbb {R}}^{p}\), we now analyze \(L_{n}(u)\). For SCAD punishment, \(p'_{\lambda }(u)=\lambda I(|u|\le \lambda )+(\max (0,a\lambda -|u|))/(a-1)I(|u|> \lambda )\). If \(u\ge a\lambda (a=3.7)\), then \(p'_{\lambda }(u)=0\). If \(|u|<\lambda \) and \(\lambda \longrightarrow 0\), we can get that \(p'_{\lambda }(u)=\lambda \longrightarrow 0\).

According to the results above, we can obtain that:

$$\begin{aligned} L_{n}(u)=n\left[ \sum _{k=1}^{q}p'_{\lambda }(|{\overline{\beta }}_{0k}|)\left( |\beta _{0k}+u_{k}/\sqrt{n}|-|\beta _{0k}| \right) \right] =0. \end{aligned}$$
(A.21)

By Condition (A2) and Eqs. (A.2), (A.15), (A.20) and (A.21), \(n\left[ {\widetilde{F}}_{SCAD}(\beta _{0}+u/\sqrt{n})-{\widetilde{F}}_{SCAD}(\beta _{0}) \right] \) is dominated by the term \(u^{T}\left[ \sum _{i=1}^{n}x_{ji}x_{ji}^{T}/n \right] u\) when \(\Vert u\Vert _{2}=M\) large enough. We can get Eq. (A.1). This in turn implies that there exists a local minimizer such that \(\Vert {\widehat{\beta }}^{SCAD}-\beta _{0}\Vert _{2}=O_{p}(n^{-1/2})\) as \(n\longrightarrow \infty \). Due to \(N=mn\), if \(\lambda =\lambda (N)\longrightarrow 0\), then \({\widehat{\beta }}^{SCAD}\) converges to \(\beta _{0}\) in probability such that \(\Vert {\widehat{\beta }}^{SCAD}-\beta _{0}\Vert _{2}=O_{p}(N^{-1/2})\) as \(N\longrightarrow \infty \). This completes the proof. \(\square \)

Lemma 3

Suppose that the sample set \(\left\{ x_{i},y_{i}\right\} _{i=1}^{N}\) is generated according to process (3.1). Under Conditions (A1) and (A2), if \(\lambda =\lambda (n)\longrightarrow 0\) and \(\sqrt{n}\lambda \longrightarrow \infty \) as \(n\longrightarrow \infty \), then with probability tending to one, for any given \(\beta _{1}\) satisfying \(\Vert \beta _{1}-\beta _{10}\Vert _{2}=O_{p}(n^{-1/2})\) and any constant M, we obtain

$$\begin{aligned} (\beta _{1}^{T},0^{T})^{T} =\arg \min \limits _{\Vert \beta _{2}\Vert \le Mn^{-\frac{1}{2}}}{\widetilde{F}}_{SCAD}((\beta _{1}^{T},\beta _{2}^{T})^{T}). \end{aligned}$$

i.e., for any \(\delta >0\), we have

$$\begin{aligned} \text {P}\left[ \inf \limits _{\Vert \beta _{2}\Vert _{2}\le Mn^{-1/2}}{\widetilde{F}}_{SCAD}((\beta _{1}^{T},\beta _{2}^{T})^{T}) >{\widetilde{F}}_{SCAD}((\beta _{1}^{T},0^{T})^{T}) \right] \ge 1-\delta . \end{aligned}$$

Proof of Lemma 3

For any \(\Vert \beta _{1}-\beta _{10}\Vert _{2}=O_{p}(n^{-1/2})\), \(\Vert \beta _{2}\Vert _{2}\le Mn^{-1/2}\), based on the proof of Theorem 1, we can obtain that

$$\begin{aligned}&n\left[ {\widetilde{F}}_{SCAD}((\beta _{1}^{T},0^{T})^{T}) -{\widetilde{F}}_{SCAD}((\beta _{1}^{T},\beta _{2}^{T})^{T}) \right] \nonumber \\&\quad =n\left[ {\widetilde{F}}_{SCAD}((\beta _{1}^{T},0^{T})^{T}) -{\widetilde{F}}_{SCAD}((\beta _{10}^{T},0^{T})^{T}) \right] \nonumber \\&\qquad -n\left[ {\widetilde{F}}_{SCAD}((\beta _{1}^{T},\beta _{2}^{T})^{T}) -{\widetilde{F}}_{SCAD}((\beta _{10}^{T},0^{T})^{T}) \right] \nonumber \\&\quad =S_{n}(\sqrt{n}((\beta _{1}-\beta _{10})^{T},0^{T})^{T}) -S_{n}(\sqrt{n}((\beta _{1}-\beta _{10})^{T},\beta _{2}^{T})^{T})\nonumber \\&\qquad - n\sum _{l=q+1}^{p}p'_{\lambda }(|{\overline{\beta }}_{l}|)|\beta _{l}| \nonumber \\&\quad =\sqrt{n}((\beta _{1}-\beta _{10})^{T},0^{T})\left[ \frac{\sum _{i=1}^{n}x_{ji}x_{ji}^{T}}{n} \right] \sqrt{n}((\beta _{1}-\beta _{10})^{T},0^{T})^{T}\nonumber \\&\qquad +W_{n}^{T}\sqrt{n}((\beta _{1}-\beta _{10})^{T},0^{T})^{T} \nonumber \\&\qquad -\sqrt{n}((\beta _{1}-\beta _{10})^{T},\beta _{2}^{T})\left[ \frac{\sum _{i=1}^{n}x_{ji}x_{ji}^{T}}{n} \right] \sqrt{n}((\beta _{1}-\beta _{10})^{T},\beta _{2}^{T})^{T}\nonumber \\&\qquad -W_{n}^{T}\sqrt{n}((\beta _{1}-\beta _{10})^{T},\beta _{2}^{T})^{T} \nonumber \\&\qquad -n\sum _{l=q+1}^{p}p'_{\lambda }(|{\overline{\beta }}_{l}|)|\beta _{l}|+o_{p}(1). \end{aligned}$$
(A.22)

According to \(\Vert \beta _{1}-\beta _{10}\Vert _{2}=O_{p}(n^{-1/2})\), \(\Vert \beta _{2}\Vert _{2}\le Mn^{-1/2}\) and (A.20), we can get that

$$\begin{aligned} \begin{aligned}&\sqrt{n}((\beta _{1}-\beta _{10})^{T},0^{T})\left[ \frac{\sum _{i=1}^{n}x_{ji}x_{ji}^{T}}{n} \right] \sqrt{n}((\beta _{1}-\beta _{10})^{T},0^{T})^{T}=O_{p}(1), \\&\sqrt{n}((\beta _{1}-\beta _{10})^{T},\beta _{2}^{T})\left[ \frac{\sum _{i=1}^{n}x_{ji}x_{ji}^{T}}{n} \right] \sqrt{n}((\beta _{1}-\beta _{10})^{T},\beta _{2}^{T})^{T}=O_{p}(1), \\ \end{aligned} \end{aligned}$$

and

$$\begin{aligned} \begin{aligned}&W_{n}^{T}\sqrt{n}((\beta _{1}-\beta _{10})^{T},0^{T})^{T} -W_{n}^{T}\sqrt{n}((\beta _{1}-\beta _{10})^{T},\beta _{2}^{T})^{T} =-\sqrt{n}W_{n}^{T} (0^{T},\beta _{2}^{T})^{T} \\&\quad =O_{p}(\sqrt{nm^{-1}c\beta _{2}^{T}I_{22}\beta _{2}})=O_{p}(1), \end{aligned} \end{aligned}$$

where \(I_{22}\) is the right-bottom \((p-q)\)-by-\((p-q)\) submatrix of I.

Under the conditions \(\lambda =\lambda (n)\longrightarrow 0\) and \(\sqrt{n}\lambda \longrightarrow \infty \) as \(n\longrightarrow \infty \) and the fact that \(\varliminf \limits _{\lambda \longrightarrow 0}\varliminf \limits _{\theta \longrightarrow 0^{+}} p'_{\lambda }(\theta )/\lambda =1\), we have

$$\begin{aligned} \begin{aligned} n\sum _{l=q+1}^{p}p'_{\lambda }(|{\overline{\beta }}_{l}|)|\beta _{l}|&=n\lambda \sum _{l=q+1}^{p}\frac{p'_{\lambda }(|{\overline{\beta }}_{l}|)}{\lambda }|\beta _{l}|\\&\ge n\sum _{l=q+1}^{p}\varliminf \limits _{\lambda \longrightarrow 0}\varliminf \limits _{\theta \longrightarrow 0^{+}}\frac{p'_{\lambda }(\theta )}{\lambda }|\beta _{l}|(1+o(1))\\&=n\lambda \sum _{l=q+1}^{p}|\beta _{l}|(1+o(1)). \end{aligned} \end{aligned}$$

Since \(\sqrt{n}\lambda \longrightarrow \infty \), we can get \(n\lambda \longrightarrow \infty \), which implies that, \(-n\sum _{l=q+1}^{p}p'_{\lambda }(|{\overline{\beta }}_{l}|)|\beta _{l}|\) of (A.22) dominates in magnitude. That is, \({\widetilde{F}}_{SCAD}((\beta _{1}^{T},0^{T})^{T}) -{\widetilde{F}}_{SCAD}((\beta _{1}^{T},\beta _{2}^{T})^{T})<0\) for large n. Then

$$\begin{aligned} \inf \limits _{\Vert \beta _{2}\Vert _{2}\le Mn^{-1/2}}{\widetilde{F}}_{SCAD}((\beta _{1}^{T},\beta _{2}^{T})^{T}) >{\widetilde{F}}_{SCAD}\left( (\beta _{1}^{T},0^{T})^{T} \right) \end{aligned}$$

with probability tending to one on any compact subset of \(\Vert \beta _{1}-\beta _{10}\Vert _{2}=O_{p}(n^{-1/2})\) and \(\Vert \beta _{2}\Vert _{2}\le Mn^{-1/2}\). This completes the proof. \(\square \)

Proof of Theorem 2

(I)As discussed in Fan and Li (2001) and by Lemma 3, when \(N=mn\), it is obviously that when \(\lambda =\lambda (N)\longrightarrow 0\) and \(\sqrt{N}\lambda \longrightarrow \infty \) as \(N\longrightarrow \infty \), we can obtain that \({\widehat{\beta }}_{2}^{SCAD}=0\). (II)According to Theorem 1, we can demonstrate that there exists a \(\sqrt{N}\)-consistent minimizer \({\widehat{\beta }}_{1}^{SCAD}\) of \({\widetilde{F}}_{SCAD}\left( (\beta _{1}^{T},0^{T})^{T} \right) \) as a function of \(\beta _{1}\).

The proof of Theorem 1 implies that \(\sqrt{n}({\widehat{\beta }}_{1}^{SCAD}-\beta _{10})\) minimizes

$$\begin{aligned} S_{n}((\theta ^{T},0^{T})^{T}) +n\sum _{k=1}^{q}p'_{\lambda }(|{\overline{\beta }}_{0k}|) \left[ |\beta _{0k}+\theta _{k}/\sqrt{n}|-|\beta _{0k}| \right] , \end{aligned}$$
(A.23)

with respect to \(\theta \), where \(\theta =(\theta _{1},\cdots ,\theta _{q})^{T}\in {\mathbb {R}}^{q}\). By the Lemma 1 and (A.15), we can obtain that

$$\begin{aligned} S_{n}((\theta ^{T},0^{T})^{T})= & {} (\theta ^{T},0^{T})\left[ \frac{\sum _{i=1}^{n}x_{ji}x_{ji}^{T}}{n} \right] (\theta ^{T},0^{T})^{T}+W_{n}^{T}(\theta ^{T},0^{T})^{T}+o_{p}(1) \nonumber \\= & {} \theta ^{T}\left[ \frac{\sum _{i=1}^{n}x_{ji}^{1}(x_{ji}^{1})^{T}}{n} \right] \theta +(W_{n}^{1})^{T}\theta +o_{p}(1), \end{aligned}$$
(A.24)

uniformly in any compact subset of \({\mathbb {R}}^{q}\), where \(x_{ji}^{1}\) is the q-dimensonal of \(x_{ji}\), \(\xi _{i}^{1}=2{\overline{\epsilon }}_{ji}x_{ji}^{1}, \eta _{i}^{1}=2\sum _{j=1}^{m}{\overline{\epsilon }}_{ji}x_{ji}^{1}, \zeta _{i}^{1}=2\epsilon _{ji}x_{ji}^{1}, D_{i}^{1}=\xi _{i}^{1}-\eta _{i}^{1}/m-\zeta _{i}^{1},i=1,2,\ldots ,n, W_{n}^{1}=\sum _{i=1}^{n}D_{i}^{1}/\sqrt{n}\). Based on the above evidence and (A.19), we can get

$$\begin{aligned} W_{n}^{1}\xrightarrow {\ d\ }N(0,m^{-1}cI_{11}). \end{aligned}$$
(A.25)

where \(I_{11}\) is the q-by-q submatrix of I.

We define

$$\begin{aligned} \begin{aligned} T1&=n\sum _{k=1}^{q}p'_{\lambda }(|{\overline{\beta }}_{0k}|) \left[ |\beta _{0k}+\theta _{k}/\sqrt{n}|-|\beta _{0k}| \right] \\&=\sum _{k=1}^{q}\sqrt{n}p'_{\lambda }(|{\overline{\beta }}_{0k}|) \frac{|\beta _{0k}+\theta _{k}/\sqrt{n}|-|\beta _{0k}|}{1/\sqrt{n}} \\&=\sum _{k=1}^{q}T_{1k}. \end{aligned} \end{aligned}$$

Note that

$$\begin{aligned} \frac{|\beta _{0k}+\theta _{k}/\sqrt{n}|-|\beta _{0k}|}{1/\sqrt{n}}\longrightarrow \text {Sign}(\beta _{0k})\theta _{k}I(\beta _{0k}\ne 0)+|\theta _{k}|I(\beta _{0k}=0). \end{aligned}$$

For the SCAD penalty, \(p'_{\lambda }(\theta )=\lambda I(|\theta |\le \lambda )+(\max (0,a\lambda -|\theta |))/(a-1)I(|\theta |> \lambda )\). If \(\theta \ge a\lambda (a=3.7)\), then \(p'_{\lambda }(\theta )=0\). If \(|\theta |<\lambda \) and \(\lambda \longrightarrow 0\), we can get that \(p'_{\lambda }(\theta )=\lambda \longrightarrow 0\). For \(\beta _{0k}\ne 0\), since \(|{\overline{\beta }}_{0k}|\xrightarrow {\ \text {P}\ }|\beta _{0k}|>0\), then \(\lambda \longrightarrow 0\) ensures \(T_{1k}\longrightarrow \sqrt{n}p'_{\lambda }(|{\overline{\beta }}_{0k}|) \text {Sign}(\beta _{0k})\theta _{k}\xrightarrow {\ \text {P}\ }0\). For \(\beta _{0k}=0\), \(T_{1k}=0\) if \(\theta _{k}=0\); when \(\theta _{k}\ne 0\), \(|{\overline{\beta }}_{0k}|=O_{p}(n^{-1/2})\) and \(p'_{\lambda }(\theta )=\lambda \) for \(|\theta |<\lambda \), we have if \(\sqrt{n}\lambda \longrightarrow \infty \), \(T_{1k}=\sqrt{n}p'_{\lambda }(|{\overline{\beta }}_{0k}|)|\theta _{k}|=|\theta _{k}|\sqrt{n}\lambda \) with probability tending to one, thus \(T_{1k}\xrightarrow {\ \text {P}\ }\infty \). Let us write \(\theta ^{*}=(\theta _{10}^{T},\theta _{20}^{T})^{T}\), then we have

$$\begin{aligned} \begin{aligned}&T_{1}\xrightarrow {\ \text {P}\ }\left\{ \begin{array}{ll} 0,\qquad \theta _{20}=0 \\ \infty ,\qquad otherwise. \end{array} \right. \end{aligned} \end{aligned}$$

In Theorem 2, \(\theta ^{*}=(\theta ^{T},0^{T})^{T}\), then we have \(T_{1}\xrightarrow {\ \text {P}\ }0\). We further can obtain that

$$\begin{aligned} n\sum _{k=1}^{q}p'_{\lambda }(|{\overline{\beta }}_{0k}|) \left[ |\beta _{0k}+\theta _{k}/\sqrt{n}|-|\beta _{0k}| \right] \xrightarrow {\ \text {P}\ }0. \end{aligned}$$
(A.26)

Based on (A.24) and (A.26) and \(\lim \limits _{n\longrightarrow \infty }\sum _{i=1}^{n}x_{ji}^{1}(x_{ji}^{1})^{T}/n=I_{11}\), where \(I_{11}\) is the q-by-q submatrix of I, we have

$$\begin{aligned}&S_{n}((\theta ^{T},0^{T})^{T}) +n\sum _{k=1}^{q}p'_{\lambda }(|{\overline{\beta }}_{0k}|) \left[ |\beta _{0k}+\theta _{k}/\sqrt{n}|-|\beta _{0k}| \right] \nonumber \\&\quad =\theta ^{T}I_{11}\theta +(W_{n}^{1})^{T}\theta +o_{p}(1). \end{aligned}$$
(A.27)

According (A.25) and Lemma 2, we can obtain that

$$\begin{aligned} \sqrt{n}({\widehat{\beta }}_{1}^{SCAD}-\beta _{10})\xrightarrow {\ d\ }N(0,\frac{c}{4m}I_{11}^{-1}). \end{aligned}$$
(A.28)

Due to \(N=mn\), we can get that

$$\begin{aligned} \sqrt{N}({\widehat{\beta }}_{1}^{SCAD}-\beta _{10})\xrightarrow {\ d\ }N(0,\frac{c}{4}I_{11}^{-1}). \end{aligned}$$
(A.29)

This completes the proof. \(\square \)

Proof of Theorem 3

(I)For any \(\Vert \beta _{1}-\beta _{10}\Vert _{2}=O_{p}(n^{-1/2})\), \(\Vert \beta _{2}\Vert \le Mn^{-1/2}\), according to (A.22), we can get that

$$\begin{aligned}&n\left[ {\widetilde{F}}_{AL}((\beta _{1}^{T},0^{T})^{T}) -{\widetilde{F}}_{AL}((\beta _{1}^{T},\beta _{2}^{T})^{T}) \right] \nonumber \\&\quad =S_{n}(\sqrt{n}((\beta _{1}-\beta _{10})^{T},0^{T})^{T}) -S_{n}(\sqrt{n}((\beta _{1}-\beta _{10})^{T},\beta _{2}^{T})^{T})- n\lambda \sum _{k=q+1}^{p}{\overline{w}}_{k}|\beta _{k}|. \end{aligned}$$
(A.30)

Note here the first two terms of (A.30) are exactly the same as in (A.22) and hence can be bounded similarly. However the third term, due to \(n^{(1+r)/2}\lambda \longrightarrow \infty \) and \(\sqrt{n}{\widehat{\beta }}_{k}=O_{p}(1)\), satisfies

$$\begin{aligned} -n\lambda \sum _{k=q+1}^{p}{\overline{w}}_{k}|\beta _{k}| =-[n^{(1+r)/2}\lambda ]\sqrt{n}\left[ \sum _{k=q+1}^{p}\left| \left( \sqrt{n}|{\widehat{\beta }}_{k} | \right) ^{-r}\right| |\beta _{k}| \right] \longrightarrow -\infty . \end{aligned}$$

These facts in turn implies that, for \(n\longrightarrow \infty \)

$$\begin{aligned} n\left[ {\widetilde{F}}_{AL}((\beta _{1}^{T},0^{T})^{T}) -{\widetilde{F}}_{AL}((\beta _{1}^{T},\beta _{2}^{T})^{T}) \right] <0. \end{aligned}$$

Then we have

$$\begin{aligned} \text {P}\left[ \inf \limits _{\Vert \beta _{2}\Vert _{2}=Mn^{-1/2}}{\widetilde{F}}_{AL}((\beta _{1}^{T},\beta _{2}^{T})^{T})>{\widetilde{F}}_{AL}((\beta _{1}^{T},0^{T})^{T}) \right] \ge 1-\delta . \end{aligned}$$

Owing to \(N=mn\) if \(\lambda =\lambda (N),\sqrt{N}\lambda \longrightarrow 0\) and \(N^{(1+r)/2}\lambda \longrightarrow \infty \) as \(N\longrightarrow \infty \), we can obtain that \({\widehat{\beta }}_{2}^{AL}=0\).

(II) According to (A.2),

$$\begin{aligned}&n\left[ {\widetilde{F}}_{AL}(\beta _{0}+u/\sqrt{n})-{\widetilde{F}}_{AL}(\beta _{0}) \right] \nonumber \\&\quad =\sum _{i=1}^{n}\left[ \rho (\epsilon _{ji}-x_{ji}^{T}u/\sqrt{n})-\rho (\epsilon _{ji}) \right] +n\left[ \langle F'_{N}({\overline{\beta }})-F'_{j}({\overline{\beta }}),u/\sqrt{n}\rangle \right] \nonumber \\&\qquad +n\lambda \sum _{k=1}^{p}\left[ {\overline{w}}_{k}|\beta _{0k}+u_{k}/\sqrt{n}|-{\overline{w}}_{k}|\beta _{0k}| \right] . \end{aligned}$$
(A.31)

We analyze the third term. For \(k=1,2,\cdots ,q\), \({\overline{w}}_{k}\xrightarrow {\ \text {P}\ }|\beta _{0k}|^{-r}\), we have \(\beta _{0k}\ne 0\), by routine calculation, we get

$$\begin{aligned} \begin{aligned}&n\lambda \sum _{k=1}^{q}\left[ {\overline{w}}_{k}|\beta _{0k}+u_{k}/\sqrt{n}|-{\overline{w}}_{k}|\beta _{0k}| \right] \\&\quad =\sqrt{n}\lambda {\overline{w}}_{k}\sum _{k=1}^{q}\frac{\left[ {\overline{w}}_{k}|\beta _{0k}+u_{k}/\sqrt{n}|-{\overline{w}}_{k}|\beta _{0k}| \right] }{1/\sqrt{n}} \\&\quad =\sqrt{n}\lambda {\overline{w}}_{k}\sum _{k=1}^{q} u_{k}\text {Sign}(\beta _{0k}), \end{aligned} \end{aligned}$$

and by \(\sqrt{n}\lambda \longrightarrow 0\), we can obtain that

$$\begin{aligned} n\lambda \sum _{k=1}^{q}\left[ {\overline{w}}_{k}|\beta _{0k}+u_{k}/\sqrt{n}| -{\overline{w}}_{k}|\beta _{0k}| \right] \xrightarrow {\ \text {P}\ }0. \end{aligned}$$
(A.32)

For \(k=q+1,q+2,\cdots ,p\) and the true coefficient \(\beta _{0k}=0\), we have

$$\begin{aligned} n\lambda \sum _{k=q+1}^{p}\left[ {\overline{w}}_{k}|\beta _{0k}+u_{k}/\sqrt{n}| -{\overline{w}}_{k}|\beta _{0k}| \right] =\sqrt{n}\lambda \sum _{k=q+1}^{p}[{\overline{w}}_{k}|u_{k}|]. \end{aligned}$$

When \(u_{k}\ne 0\), \(\sqrt{n}\lambda {\overline{w}}_{k}=n^{(1+r)/2}\lambda (\sqrt{n}|{\widehat{\beta }}_{k}|)^{-r}\), \(n^{(1+r)/2}\lambda \longrightarrow \infty \) and \(\sqrt{n}|{\widehat{\beta }}_{k}|=O_{p}(1)\), we can get that

$$\begin{aligned} n\lambda \left[ {\overline{w}}_{k}|\beta _{0k}+u_{k}/\sqrt{n}| -{\overline{w}}_{k}|\beta _{0k}| \right] \longrightarrow \infty . \end{aligned}$$
(A.33)

When \(u_{k}=0\), \(n\lambda \left[ {\overline{w}}_{k}|\beta _{0k}+u_{k}/\sqrt{n}| -{\overline{w}}_{k}|\beta _{0k}| \right] =0\).

We define \(u^{*}=(u_{10}^{T},u_{20}^{T})^{T}\) and \(W_{n}=((W_{n}^{1})^{T},(W_{n}^{2})^{T})^{T}\). Combining (A.31), (A.32), (A.33), Theorem 1 and Condition (A2), we conclude that for each fixed u,

$$\begin{aligned} \begin{aligned}&n\left[ {\widetilde{F}}_{AL}(\beta _{0}+u/\sqrt{n})-{\widetilde{F}}_{AL}(\beta _{0}) \right] \xrightarrow {\ d\ }V(u)\\&\quad = \left\{ \begin{array}{ll} (u^{1})^{T}I_{11}u^{1}+(W_{n}^{1})^{T}u^{1},\quad if \quad u_{20}=0, \\ \infty ,\qquad otherwise, \end{array} \right. \end{aligned} \end{aligned}$$
(A.34)

where \(u^{1}=(u_{1},u_{2},\cdots ,u_{q})^{T}\). Noted that \(n\left[ {\widetilde{F}}_{AL}(\beta _{0}+u/\sqrt{n})-{\widetilde{F}}_{AL}(\beta _{0}) \right] \) is convex in u and V(u) has a unique minimizer, then we have

$$\begin{aligned} \arg \min \limits _{u\in {\mathbb {R}}^{q}}n\left[ {\widetilde{F}}_{AL}(\beta _{0}+u/\sqrt{n})-{\widetilde{F}}_{AL}(\beta _{0}) \right] =\sqrt{n}({\widehat{\beta }}_{1}^{(AL)}-\beta _{0}) \xrightarrow {\ d\ }\arg \min \limits _{u\in {\mathbb {R}}^{q}}V(u).\nonumber \\ \end{aligned}$$
(A.35)

According to (A.25), (A.34), (A.35) and Lemma 2, we can obtain that

$$\begin{aligned} \sqrt{n}({\widehat{\beta }}_{1}^{AL}-\beta _{10})\xrightarrow {\ d\ }N(0,\frac{c}{4m}I_{11}^{-1}). \end{aligned}$$
(A.36)

Due to \(N=mn\), if \(\sqrt{N}\lambda \longrightarrow 0\) and \(N^{(1+r)/2}\lambda \longrightarrow \infty \) as \(N\longrightarrow \infty \), we can get that

$$\begin{aligned} \sqrt{N}({\widehat{\beta }}_{1}^{AL}-\beta _{10})\xrightarrow {\ d\ }N(0,\frac{c}{4}I_{11}^{-1}). \end{aligned}$$
(A.37)

This completes the proof. \(\square \)

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Liu, Z., Zhao, X. & Pan, Y. Communication-efficient distributed estimation for high-dimensional large-scale linear regression. Metrika (2022). https://doi.org/10.1007/s00184-022-00878-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00184-022-00878-x

Keywords

  • Distributed optimization
  • SCAD
  • Adaptive LASSO
  • GEL function
  • Modified proximal ADMM algorithm