Abstract
In the Master-Worker distributed structure, this paper provides a regularized gradient-enhanced loss (GEL) function based on the high-dimensional large-scale linear regression with SCAD and adaptive LASSO penalty. The importance and originality of this paper have two aspects: (1) Computationally, to take full advantage of the computing power of each machine and speed up the convergence, our proposed distributed upgraded estimation method can make all Workers optimize their corresponding GEL functions in parallel, and the results are then aggregated by the Master; (2) In terms of communication, the proposed modified proximal alternating direction method of the multipliers (ADMM) algorithm is comparable to the Centralize method based on the full sample during a few rounds of communication. Under some mild assumptions, we establish the Oracle properties of the SCAD and adaptive LASSO penalized linear regression. The finite sample properties of the newly suggested method are assessed through simulation studies. An application to the HIV drug susceptibility study demonstrates the utility of the proposed method in practice.
Similar content being viewed by others
References
Attouch H (2020) Fast inertial proximal ADMM algorithms for convex structured optimization with linear constraint. https://hal.archives-ouvertes.fr/hal-02501604
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine learning 3(1):1–122
Chen X, Liu W, Mao X, Yang Z (2020) Distributed high-dimensional regression under a quantile loss function. J Mach Learn Res 21(182):1–43
Cheng G, Shang Z (2015) Computational limits of divide-and-conquer method. arXiv preprint arXiv:1512.09226, 8
Coffin JM (1995) HIV population dynamics in vivo: implications for genetic variation, pathogenesis, and therapy. Science 267(5197):483–489
Duchi JC, Agarwal A, Wainwright MJ (2012) Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Trans Autom Control 57(3):592–606
Eckstein J (1994) Some saddle-function splitting methods for convex programming. Optimization Methods and Software 4(1):75–83
Fan J, Guo Y, Wang K (2021) Communication-efficient accurate statistical estimation. Journal of the American Statistical Association, 1-11
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360
Fan Y, Lin N, Yin X (2021) Penalized quantile regression for distributed big data using the slack variable representation. J Comput Graph Stat 30(3):557–565
Fazel M, Pong TK, Sun D, Tseng P (2013) Hankel matrix rank minimization with applications to system identification and realization. SIAM J Matrix Anal Appl 34(3):946–977
Gao Y, Liu W, Wang H, Wang X, Yan Y, Zhang R (2021) A review of distributed statistical inference. Statistical Theory and Related Fields, 1-11
Gu Y, Fan J, Kong L, Ma S, Zou H (2018) ADMM for high-dimensional sparse penalized quantile regression. Technometrics 60(3):319–331
He B, Liao LZ, Han D, Yang H (2002) A new inexact alternating directions method for monotone variational inequalities. Math Program 92(1):103–118
Hjort NL, Pollard D (2011) Asymptotics for minimisers of convex processes. arXiv preprint arXiv:1107.3806
Hu A, Li C, Wu J (2021) Communication-efficient modeling with penalized quantile regression for distributed data. Complexity, 2021
Huang C, Huo X (2019) A distributed one-step estimator. Math Program 174(1):41–76
Jordan MI, Lee JD, Yang Y (2019) Communication-efficient distributed statistical inference. J Am Stat Assoc 114(526):668–681
Kannan R, Vempala S, Woodruff D (2014) Principal component analysis and higher correlations for distributed data. In Conference on Learning Theory 35:1040–1057
Lee JD, Sun Y, Liu Q, Taylor JE (2015) Communication-efficient sparse regression: a one-shot approach. arXiv preprint arXiv:1503-1503
Lian H, Liu J, Fan Z (2021) Distributed learning for sketched kernel regression. Neural Netw 143:368–376
Lin SB, Guo X, Zhou DX (2017) Distributed learning with regularized least squares. The Journal of Machine Learning Research 18(1):3202–3232
Lu J, Cheng G, Liu H (2016) Nonparametric heterogeneity testing for massive data. arXiv preprint arXiv:1601.06212
Mcdonald R, Mohri M, Silberman N, Walker D, Mann G (2009) Efficient large-scale distributed training of conditional maximum entropy models. Adv Neural Inf Process Syst 22:1231–1239
Pan Y, Liu Z, Cai W (2020) Large-scale expectile regression with covariates missing at random. IEEE Access 8:36502–36513
Pan Y (2021) Distributed optimization and statistical learning for large-scale penalized expectile regression. Journal of the Korean Statistical Society 50(1):290–314
Pollard D (1991) Asymptotics for least absolute deviation regression estimators. Economet Theor 7(2):186–199
Rhee SY, Gonzales MJ, Kantor R, Betts BJ, Ravela J, Shafer RW (2003) Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Res 31(1):298–303
Rosenblatt JD, Nadler B (2016) On the optimality of averaging in distributed statistical learning. Information and Inference: A Journal of the IMA 5(4):379–404
Shamir O, Srebro N, Zhang T (2014) Communication-efficient distributed optimization using an approximate newton-type method. In International Conference on Machine Learning 32(2):1000–1008
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc: Ser B (Methodol) 58(1):267–288
Wang HJ, McKeague IW, Qian M (2018) Testing for marginal linear effects in quantile regression. J R Stat Soc Ser B Stat Methodol 80(2):433–433
Wang J, Kolar M, Srebro N, Zhang T (2017) Efficient distributed learning with sparsity. In International Conference on Machine Learning 70:3636–3645
Wang L, Lian H (2020) Communication-efficient estimation of high-dimensional quantile regression. Anal Appl 18(06):1057–1075
Xu G, Shang Z, Cheng G (2018) Optimal tuning for divide-and-conquer kernel ridge regression with massive data. In International Conference on Machine Learning 80:5483–5491
Zhang Y, Duchi JC, Wainwright MJ (2013) Communication-efficient algorithms for statistical optimization. The Journal of Machine Learning Research 14(1):3321–3363
Zhang Y, Lin X (2015) DiSCO: Distributed optimization for self-concordant empirical loss. In International conference on machine learning 37:362–370
Zhang CH, Zhang T (2012) A general theory of concave regularization for high-dimensional sparse estimation problems. Stat Sci 27(4):576–593
Zhao W, Zhang F, Lian H (2019) Debiasing and distributed estimation for high-dimensional quantile regression. IEEE Transactions on Neural Networks and Learning Systems 31(7):2569–2577
Zhu X, Li F, Wang H (2021) Least-square approximation for a distributed system. J Comput Graph Stat 30:1004–1018
Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429
Zou H, Li R (2008) One-step sparse estimates in nonconcave penalized likelihood models. Ann Stat 36(4):1509–1509
Acknowledgements
This work is supported by the National Natural Science Foundation of China (NSFC) (No. 11901175), the Science and Technology Research Project of Hubei Education Department (No. Q20211007).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Proof of Theorem
Appendix: Proof of Theorem
In this section, all needed lemmas and proofs are listed.
Lemma 1
Let \(\left\{ S_{n}(u):u\in U\right\} \) be a sequence of random convex functions defined on convex, open subset U of \({\mathbb {R}}^{p}\). Suppose \(S(\cdot )\) is a real-valued function on U for \(S_{n}(u)\longrightarrow S(u)\) in probability for each \(u\in U\). Then for each compact subset K of U, \(\sup \limits _{u\in K}|S_{n}(u)-S(u)|\longrightarrow 0\) in probability and the function \(S(\cdot )\) is necessarily convex on U.
The proof of Lemma 1 can be found in Pollard (1991).
Lemma 2
Let V be a symmetric and positive definite matrix, W be a random variable and \(A_{n}(u)\) be a convex objective function. If
then \(\alpha _{n}\), the \(\arg \min \) of \(A_{n}(u)\), satisfies: \(\alpha _{n} \xrightarrow {\ d\ }-V^{-1}W\).
The proof of Lemma 2 can be found in Hjort and Pollard (2011).
Proof of Theorem 1
In this case, note that \({\widetilde{F}}(\beta )\), defined in (2.7), is not a convex function since the non-convexity of SCAD penalty. Based on this fact, we should consider the local minimizer instead of the global \({\widehat{\beta }}^{SCAD}\).
According to Lemma 1 and Fan and Li (2001), for any given \(\delta >0\), there is a large enough constant M such that
which implies that with probability at least \(1-\delta \) there exists a local minimum in the ball \(\left\{ \beta _{0}+u/\sqrt{n}:\Vert u\Vert _{2}\le M\right\} \). This in turn implies that there exists a local minimizer such that \(\Vert {\widehat{\beta }}^{(SCAD)}-\beta _{0}\Vert _{2}=O_{p}(n^{-1/2})\). Performing some simple calculations, we have
where \(\epsilon _{ji}=y_{ji}-x_{ji}^{T}\beta _{0}\), and \(q<p\)
Given any fixed u, then
Due to \(\rho (u)=u^{2}\), then \(\rho '(u)=2u\) and \(\rho '(\epsilon _{ji})=2\epsilon _{ji}\). According to \(E(\epsilon _{ji})=0\) in Condition (A1), we have \(2E(\epsilon _{ji})=0\). Denote \(Q_{ji}(t)=\text {E}[\rho (\epsilon _{ji}-t)-\rho (\epsilon _{ji})]\), by the second-order Taylor expansion of \(Q_{ji}(t)\) at 0, we have
According to Condition (A2), we have \(\sum _{i=1}^{n}x_{i}x_{i}^{T}/{n}\longrightarrow I\). Further, we have
where u is fixed.
Denote \(R_{ji,n}=\rho (\epsilon _{ji}-x_{ji}^{T}u/\sqrt{n})-\rho (\epsilon _{ji})+2\epsilon _{ji} x_{ji}^{T}u/\sqrt{n}\), then by Taylor expansion, there exist \(g_{ji}\) between \(\epsilon _{ji}-x_{ji}^{T}u/\sqrt{n}\) and \(\epsilon _{ji}\) such that
since \(\rho ''(g_{ji})=2\). Then we have
Under the Condition (A2), we have \(\sum _{i=1}^{n} u^{T}\left[ (x_{ji}x_{ji}^{T})/{n} \right] u\longrightarrow u^{T}Iu\). Then, by the fact that \(\max \limits _{1\le i\le n}(\Vert x_{ji}\Vert /{\sqrt{n}})\longrightarrow 0\) and \(\Vert u\Vert _{2}=M\), we have
Based on the above results, we have
Combining \(\text {E}\left[ \sum _{i=1}^{n}(R_{ji,n}-\text {E}R_{ji,n}) \right] ^{2}\le \sum _{i=1}^{n}\text {E}R_{ji,n}^{2}\) with (A.11), we can obtain that \(\sum _{i=1}^{n}(R_{ji,n}-\text {E}R_{ji,n})\longrightarrow 0\). Thus, we can get
For the S2, we have
where \({\overline{\epsilon }}_{ji}=y_{ji}-x_{ji}^{T}{\overline{\beta }}\). Further, we have
Now we can simplify \(S_{n}(u)\) to
where \(\xi _{i}=2{\overline{\epsilon }}_{ji}x_{ji}, \eta _{i}=2\sum _{j=1}^{m}{\overline{\epsilon }}_{ji}x_{ji}, \zeta _{i}=2\epsilon _{ji}x_{ji}, D_{i}=\xi _{i}-{\eta _{i}}/{m}-\zeta _{i},i=1,2,\ldots ,n, W_{n}=({1}/{\sqrt{n}})\sum _{i=1}^{n}D_{i}\).
We deal with \(W_{n}\), according to Condition (A1), we can get
where \(a=4\text {Var}({\overline{\epsilon }}_{ji}), b=4\text {Cov}({\overline{\epsilon }}_{ji},\epsilon _{ji})\). Since \(D_{i}\) is independent and identically distributed zero-mean random vectors, we have
Under Condition (A2), we have
By routine calculations, we can get
where \(n\longrightarrow \infty \) and \(c=(m-1)a+(2-2m)b+4m\sigma ^{2}\). By Central limit theorem, we can obtain that
We can know that \(W_{n}^{T}u\) is bounded in probability,i.e.,
By applying Lemma 1 to \(Z_{n}(u)=S_{n}(u)-W_{n}^{T}u\), we can strengthen this point-wise convergence to uniform convergence on compact subset of \({\mathbb {R}}^{p}\), we now analyze \(L_{n}(u)\). For SCAD punishment, \(p'_{\lambda }(u)=\lambda I(|u|\le \lambda )+(\max (0,a\lambda -|u|))/(a-1)I(|u|> \lambda )\). If \(u\ge a\lambda (a=3.7)\), then \(p'_{\lambda }(u)=0\). If \(|u|<\lambda \) and \(\lambda \longrightarrow 0\), we can get that \(p'_{\lambda }(u)=\lambda \longrightarrow 0\).
According to the results above, we can obtain that:
By Condition (A2) and Eqs. (A.2), (A.15), (A.20) and (A.21), \(n\left[ {\widetilde{F}}_{SCAD}(\beta _{0}+u/\sqrt{n})-{\widetilde{F}}_{SCAD}(\beta _{0}) \right] \) is dominated by the term \(u^{T}\left[ \sum _{i=1}^{n}x_{ji}x_{ji}^{T}/n \right] u\) when \(\Vert u\Vert _{2}=M\) large enough. We can get Eq. (A.1). This in turn implies that there exists a local minimizer such that \(\Vert {\widehat{\beta }}^{SCAD}-\beta _{0}\Vert _{2}=O_{p}(n^{-1/2})\) as \(n\longrightarrow \infty \). Due to \(N=mn\), if \(\lambda =\lambda (N)\longrightarrow 0\), then \({\widehat{\beta }}^{SCAD}\) converges to \(\beta _{0}\) in probability such that \(\Vert {\widehat{\beta }}^{SCAD}-\beta _{0}\Vert _{2}=O_{p}(N^{-1/2})\) as \(N\longrightarrow \infty \). This completes the proof. \(\square \)
Lemma 3
Suppose that the sample set \(\left\{ x_{i},y_{i}\right\} _{i=1}^{N}\) is generated according to process (3.1). Under Conditions (A1) and (A2), if \(\lambda =\lambda (n)\longrightarrow 0\) and \(\sqrt{n}\lambda \longrightarrow \infty \) as \(n\longrightarrow \infty \), then with probability tending to one, for any given \(\beta _{1}\) satisfying \(\Vert \beta _{1}-\beta _{10}\Vert _{2}=O_{p}(n^{-1/2})\) and any constant M, we obtain
i.e., for any \(\delta >0\), we have
Proof of Lemma 3
For any \(\Vert \beta _{1}-\beta _{10}\Vert _{2}=O_{p}(n^{-1/2})\), \(\Vert \beta _{2}\Vert _{2}\le Mn^{-1/2}\), based on the proof of Theorem 1, we can obtain that
According to \(\Vert \beta _{1}-\beta _{10}\Vert _{2}=O_{p}(n^{-1/2})\), \(\Vert \beta _{2}\Vert _{2}\le Mn^{-1/2}\) and (A.20), we can get that
and
where \(I_{22}\) is the right-bottom \((p-q)\)-by-\((p-q)\) submatrix of I.
Under the conditions \(\lambda =\lambda (n)\longrightarrow 0\) and \(\sqrt{n}\lambda \longrightarrow \infty \) as \(n\longrightarrow \infty \) and the fact that \(\varliminf \limits _{\lambda \longrightarrow 0}\varliminf \limits _{\theta \longrightarrow 0^{+}} p'_{\lambda }(\theta )/\lambda =1\), we have
Since \(\sqrt{n}\lambda \longrightarrow \infty \), we can get \(n\lambda \longrightarrow \infty \), which implies that, \(-n\sum _{l=q+1}^{p}p'_{\lambda }(|{\overline{\beta }}_{l}|)|\beta _{l}|\) of (A.22) dominates in magnitude. That is, \({\widetilde{F}}_{SCAD}((\beta _{1}^{T},0^{T})^{T}) -{\widetilde{F}}_{SCAD}((\beta _{1}^{T},\beta _{2}^{T})^{T})<0\) for large n. Then
with probability tending to one on any compact subset of \(\Vert \beta _{1}-\beta _{10}\Vert _{2}=O_{p}(n^{-1/2})\) and \(\Vert \beta _{2}\Vert _{2}\le Mn^{-1/2}\). This completes the proof. \(\square \)
Proof of Theorem 2
(I)As discussed in Fan and Li (2001) and by Lemma 3, when \(N=mn\), it is obviously that when \(\lambda =\lambda (N)\longrightarrow 0\) and \(\sqrt{N}\lambda \longrightarrow \infty \) as \(N\longrightarrow \infty \), we can obtain that \({\widehat{\beta }}_{2}^{SCAD}=0\). (II)According to Theorem 1, we can demonstrate that there exists a \(\sqrt{N}\)-consistent minimizer \({\widehat{\beta }}_{1}^{SCAD}\) of \({\widetilde{F}}_{SCAD}\left( (\beta _{1}^{T},0^{T})^{T} \right) \) as a function of \(\beta _{1}\).
The proof of Theorem 1 implies that \(\sqrt{n}({\widehat{\beta }}_{1}^{SCAD}-\beta _{10})\) minimizes
with respect to \(\theta \), where \(\theta =(\theta _{1},\cdots ,\theta _{q})^{T}\in {\mathbb {R}}^{q}\). By the Lemma 1 and (A.15), we can obtain that
uniformly in any compact subset of \({\mathbb {R}}^{q}\), where \(x_{ji}^{1}\) is the q-dimensonal of \(x_{ji}\), \(\xi _{i}^{1}=2{\overline{\epsilon }}_{ji}x_{ji}^{1}, \eta _{i}^{1}=2\sum _{j=1}^{m}{\overline{\epsilon }}_{ji}x_{ji}^{1}, \zeta _{i}^{1}=2\epsilon _{ji}x_{ji}^{1}, D_{i}^{1}=\xi _{i}^{1}-\eta _{i}^{1}/m-\zeta _{i}^{1},i=1,2,\ldots ,n, W_{n}^{1}=\sum _{i=1}^{n}D_{i}^{1}/\sqrt{n}\). Based on the above evidence and (A.19), we can get
where \(I_{11}\) is the q-by-q submatrix of I.
We define
Note that
For the SCAD penalty, \(p'_{\lambda }(\theta )=\lambda I(|\theta |\le \lambda )+(\max (0,a\lambda -|\theta |))/(a-1)I(|\theta |> \lambda )\). If \(\theta \ge a\lambda (a=3.7)\), then \(p'_{\lambda }(\theta )=0\). If \(|\theta |<\lambda \) and \(\lambda \longrightarrow 0\), we can get that \(p'_{\lambda }(\theta )=\lambda \longrightarrow 0\). For \(\beta _{0k}\ne 0\), since \(|{\overline{\beta }}_{0k}|\xrightarrow {\ \text {P}\ }|\beta _{0k}|>0\), then \(\lambda \longrightarrow 0\) ensures \(T_{1k}\longrightarrow \sqrt{n}p'_{\lambda }(|{\overline{\beta }}_{0k}|) \text {Sign}(\beta _{0k})\theta _{k}\xrightarrow {\ \text {P}\ }0\). For \(\beta _{0k}=0\), \(T_{1k}=0\) if \(\theta _{k}=0\); when \(\theta _{k}\ne 0\), \(|{\overline{\beta }}_{0k}|=O_{p}(n^{-1/2})\) and \(p'_{\lambda }(\theta )=\lambda \) for \(|\theta |<\lambda \), we have if \(\sqrt{n}\lambda \longrightarrow \infty \), \(T_{1k}=\sqrt{n}p'_{\lambda }(|{\overline{\beta }}_{0k}|)|\theta _{k}|=|\theta _{k}|\sqrt{n}\lambda \) with probability tending to one, thus \(T_{1k}\xrightarrow {\ \text {P}\ }\infty \). Let us write \(\theta ^{*}=(\theta _{10}^{T},\theta _{20}^{T})^{T}\), then we have
In Theorem 2, \(\theta ^{*}=(\theta ^{T},0^{T})^{T}\), then we have \(T_{1}\xrightarrow {\ \text {P}\ }0\). We further can obtain that
Based on (A.24) and (A.26) and \(\lim \limits _{n\longrightarrow \infty }\sum _{i=1}^{n}x_{ji}^{1}(x_{ji}^{1})^{T}/n=I_{11}\), where \(I_{11}\) is the q-by-q submatrix of I, we have
According (A.25) and Lemma 2, we can obtain that
Due to \(N=mn\), we can get that
This completes the proof. \(\square \)
Proof of Theorem 3
(I)For any \(\Vert \beta _{1}-\beta _{10}\Vert _{2}=O_{p}(n^{-1/2})\), \(\Vert \beta _{2}\Vert \le Mn^{-1/2}\), according to (A.22), we can get that
Note here the first two terms of (A.30) are exactly the same as in (A.22) and hence can be bounded similarly. However the third term, due to \(n^{(1+r)/2}\lambda \longrightarrow \infty \) and \(\sqrt{n}{\widehat{\beta }}_{k}=O_{p}(1)\), satisfies
These facts in turn implies that, for \(n\longrightarrow \infty \)
Then we have
Owing to \(N=mn\) if \(\lambda =\lambda (N),\sqrt{N}\lambda \longrightarrow 0\) and \(N^{(1+r)/2}\lambda \longrightarrow \infty \) as \(N\longrightarrow \infty \), we can obtain that \({\widehat{\beta }}_{2}^{AL}=0\).
(II) According to (A.2),
We analyze the third term. For \(k=1,2,\cdots ,q\), \({\overline{w}}_{k}\xrightarrow {\ \text {P}\ }|\beta _{0k}|^{-r}\), we have \(\beta _{0k}\ne 0\), by routine calculation, we get
and by \(\sqrt{n}\lambda \longrightarrow 0\), we can obtain that
For \(k=q+1,q+2,\cdots ,p\) and the true coefficient \(\beta _{0k}=0\), we have
When \(u_{k}\ne 0\), \(\sqrt{n}\lambda {\overline{w}}_{k}=n^{(1+r)/2}\lambda (\sqrt{n}|{\widehat{\beta }}_{k}|)^{-r}\), \(n^{(1+r)/2}\lambda \longrightarrow \infty \) and \(\sqrt{n}|{\widehat{\beta }}_{k}|=O_{p}(1)\), we can get that
When \(u_{k}=0\), \(n\lambda \left[ {\overline{w}}_{k}|\beta _{0k}+u_{k}/\sqrt{n}| -{\overline{w}}_{k}|\beta _{0k}| \right] =0\).
We define \(u^{*}=(u_{10}^{T},u_{20}^{T})^{T}\) and \(W_{n}=((W_{n}^{1})^{T},(W_{n}^{2})^{T})^{T}\). Combining (A.31), (A.32), (A.33), Theorem 1 and Condition (A2), we conclude that for each fixed u,
where \(u^{1}=(u_{1},u_{2},\cdots ,u_{q})^{T}\). Noted that \(n\left[ {\widetilde{F}}_{AL}(\beta _{0}+u/\sqrt{n})-{\widetilde{F}}_{AL}(\beta _{0}) \right] \) is convex in u and V(u) has a unique minimizer, then we have
According to (A.25), (A.34), (A.35) and Lemma 2, we can obtain that
Due to \(N=mn\), if \(\sqrt{N}\lambda \longrightarrow 0\) and \(N^{(1+r)/2}\lambda \longrightarrow \infty \) as \(N\longrightarrow \infty \), we can get that
This completes the proof. \(\square \)
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, Z., Zhao, X. & Pan, Y. Communication-efficient distributed estimation for high-dimensional large-scale linear regression. Metrika 86, 455–485 (2023). https://doi.org/10.1007/s00184-022-00878-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00184-022-00878-x