Abstract
Large-scale data from various research fields are not only heterogeneous and sparse but also difficult to store on a single machine. Expectile regression is a popular alternative for modeling heterogeneous data. In this paper, we devise a distributed optimization approach to SCAD and adaptive LASSO penalized expectile regression, where the observations are randomly partitioned across multiple machines. We construct a penalized communication-efficient surrogate loss (CSL) function. Computationally, our method based on the CSL function requires only the master machine to solve a regular M-estimation problem, while other worker machines compute the gradient of the loss function on local data. Our method matches the estimation error bound of the centralized method during consecutive rounds of communication. Under some mild assumptions, we establish the oracle properties of the SCAD and adaptive LASSO penalized expectile regression. We then develop a modified alternating direction method of multipliers (ADMM) algorithm for the implementation of the proposed estimator. A series of simulation studies are conducted to assess the finite-sample performance of the proposed estimator. Applications to an HIV study demonstrate the practicability of the proposed method.
Similar content being viewed by others
References
Boyd, S. P., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine learning, 3(1), 1–122.
Cheng, G., & Shang, Z. (2015). Computational limits of divide-and-conquer method. arXiv preprintarXiv:1512.09226
Dekel, O., Gilad-Bachrach, R., Shamir, O., & Xiao, L. (2012). Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(1), 165–202.
Fan, J., Fan, Y., & Barut, E. (2014a). Adaptive robust variable selection. Annals of Statistics, 42(1), 324–351.
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.
Fan, J., Xue, L., & Zou, H. (2014b). Strong oracle optimality of folded concave penalized estimation. Annals of Statistics, 42(3), 819–849.
Gu, Y., Fan, J., Kong, L., Ma, S., & Zou, H. (2018). ADMM for high-dimensional sparse penalized quantile regression. Technometrics, 60(3), 319–331.
Jaggi, M., Smith, V., Takac, M., Terhorst, J., Krishnan, S., Hofmann, T., & Jordan, M. I. (2014). Communication-efficient distributed dual coordinate ascent. In Advances in neural information processing systems. arXiv:1409.1458v2.
Jones, M. C. (1994). Expectiles and M-quantiles are quantiles. Statistics and Probability Letters, 20(2), 149–153.
Jordan, M. I., Lee, J. D., & Yang, Y. (2019). Communication-efficient distributed statistical learning. Journal of the American Statistical Association, 114(526), 668–681.
Liao, L., Park, C., & Choi, H. (2019). Penalized expectile regression: an alternative to penalized quantile regression. Annals of the Institute of Statistical Mathematics, 71, 409–438.
Newey, W. K., & Powell, J. L. (1987). Asymmetric least squares estimation and testing. Econometrica, 55(4), 819–847.
Pan, Y., Liu, Z., & Cai, W. (2020). Large-scale expectile regression with covariates missing at random. IEEE Access,. https://doi.org/10.1109/ACCESS.2020.2970741.
Pollard, D. (1991). Asymptotics for least absolute deviation regression estimators. Econometric Theory, 7(2), 186–199.
Rosenblatt, J. D., & Nadler, B. (2016). On the optimality of averaging in distributed statistical learning. Information and Inference: A Journal of the IMA, 5(4), 379–404.
Schnabel, S. K., & Eilers, P. H. (2009). Optimal expectile smoothing. Computational Statistics and Data Analysis, 53(12), 4168–4177.
Shamir, O., Srebro, N., & Zhang, T. (2014). Communication-efficient distributed optimization using an approximate newton-type method. International Conference on Machine Learning, 1000–1008.
Sobotka, F., & Kneib, T. (2012). Geoadditive expectile regression. Computational Statistics and Data Analysis, 56(4), 755–767.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.
Waltrup, L. S., Sobotka, F., Kneib, T., & Kauermann, G. (2015). Expectile and quantile regression-David and Goliath? Statistical Modelling, 15(5), 433–456.
Wang, J., Kolar, M., & Srerbo, N. (2016). Distributed multi-task learning. In Artificial intelligence and statistics. arXiv:1510.00633v1
Wang, J., Kolar, M., Srebro, N., & Zhang, T. (2017). Efficient distributed learning with sparsity. Proceedings of the 34th International Conference on Machine Learning. JMLR.org, 70, 3636–3645.
Wang, L., Wu, Y., & Li, R. (2012). Quantile regression for analyzing heterogeneity in ultra-high dimension. Journal of the American Statistical Association, 107(497), 214–222.
Wu, Y., & Liu, Y. (2009). Variable selection in quantile regression. Statistica Sinica, 19(2), 801–817.
Zhang, Y., Duchi, J. C., & Wainwright, M. J. (2013). Communication-efficient algorithms for statistical optimization. Journal of Machine Learning Research, 14(68), 3321–3363.
Zhang, Y., & Xiao, L. (2015). Communication-efficient distributed optimization of self-concordant empirical loss. arXiv preprintarXiv:1501.00263
Zhao, T., Cheng, G., & Liu, H. (2016). A partially linear framework for massive heterogeneous data. Annals of Statistics, 44(4), 1400–1437.
Zhao, J., & Zhang, Y. (2018). Variable selection in expectile regression. Communications in Statistics-Theory and Methods, 47(7), 1731–1746.
Ziegel, J. F. (2016). Coherence and elicitability. Mathematical Finance, 26(4), 901–918.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429.
Zou, H., & Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36(4), 1509–1533.
Acknowledgements
This research is supported in part the National Science Foundation of China (11901175 to Y. P.) and the Hubei Key Laboratory of Applied Mathematics, Faculty of Mathematics and Statistics, Hubei University (HBAM201907 to Y. P.).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Proof of Theorem
Appendix: Proof of Theorem
Proof of Theorem 1
The proposed likelihood function \({\widetilde{L}}_{\text {SCAD}}(\beta )\) is not a convex function. In this case, we should consider the local minimizer instead of the global one \({\widehat{\beta }}^{(\text {SCAD})}\). To avoid confusion, we still denote the local solution by \({\widehat{\beta }}^{(\text {SCAD})}\). Inspired by the idea of Pollard (1991) and Fan and Li (2001), we show that for any given \(\delta >0\), there exists a large enough constant c, such that
Note that (18) shows that there exists a local minimum in the ball \(\left\{ \beta _{0}+\frac{u}{\sqrt{n}}: \left\| u\right\| _{2}\le c\right\}\), which implies that there exists a local minimizer such that \(\left\| {\widehat{\beta }}^{(\text {SCAD})}-\beta _{0}\right\| _{2}=O_{p}(n^{-\frac{1}{2}})\). By these facts, the proof of Theorem 1 will be complete if we can show that \(n[{\widetilde{L}}_{\text {SCAD}}(\beta _{0}+\frac{u}{\sqrt{n}})\) \(-{\widetilde{L}}_{\text {SCAD}}(\beta _{0})]\) is dominated when \(\left\| u\right\| _{2}\) equal to sufficiently large c.
For simplicity, the dataset stored on the first machine are denoted by \(\{x_{1i}, y_{1i}\}\overset{\wedge }{=}\{x_{i}, y_{i}\}_{i=1}^{n}\). Let \({\overline{\epsilon }}_{i}=y_{i}-x_{i}^{\text {T}}{\overline{\beta }}\), \({\overline{\epsilon }}_{ji}=y_{ji}-x_{ji}^{\text {T}}{\overline{\beta }}\), \(\varphi _{\tau }(u)=2\tau uI(u >0)+2(1-\tau )uI(u \le 0)\), \(\nabla F({\overline{\beta }})\overset{\wedge }{=}\nabla L_{N}({\overline{\beta }})-\nabla L_{1}({\overline{\beta }})\), \(\epsilon _{i}=y_{i}-x_{i}^{\text {T}}\beta _{0}\), by performing a simple calculation, we have \(\nabla F({\overline{\beta }})=\frac{1}{n}\mathop {\sum }\nolimits _{i=1}^{n} x_{i}\varphi _{\tau }({\overline{\epsilon }}_{i})-\frac{1}{mn}\mathop {\sum }\nolimits _{j=1}^{m} \mathop {\sum }\nolimits _{i=1}^{n} x_{ji}\varphi _{\tau }({\overline{\epsilon }}_{ji})\). Thus
where \(\text {I}=\mathop {\sum }\limits _{i=1}^{n} \left[ \rho _{\tau }\left( \epsilon _{i}-\frac{x_{i}^{\text {T}}u}{\sqrt{n}} \right) - \rho _{\tau }(\epsilon _{i}) \right]\), \(\text {II}=u^{\text {T}}\left[ \frac{1}{\sqrt{n}}\mathop {\sum }\limits _{i=1}^{n} \left( x_{i}\varphi _{\tau }({\overline{\epsilon }}_{i})-\frac{1}{m} \sum _{j=1}^{m}x_{ji}\varphi _{\tau }({\overline{\epsilon }}_{ji}) \right) \right]\), \(\text {III}=n\mathop {\sum }\limits _{k=1}^{p} \left[ p_{\lambda }\left( \left| \beta _{0k}+\frac{u_{k}}{\sqrt{n}}\right| \right) -p_{\lambda } (\left| \beta _{0k}\right| ) \right]\).
Under Assumptions (A1) and (A2), similarly to the arguments of Zhao and Zhang (2018), we obtain
where \(g(\tau )=\tau (1-F_{\epsilon }(0))+(1-\tau )F_{\epsilon }(0)\). Therefore, we obtain
where \(W_{n}=\frac{1}{\sqrt{n}}\mathop {\sum }\limits _{i=1}^{n} D_{i}\), and
with \(\xi _{i}=x_{i}\varphi _{\tau }({\overline{\epsilon }}_{i})\), \(\eta _{i}=\mathop {\sum }\limits _{j=1}^{m} x_{ji}\varphi _{\tau }({\overline{\epsilon }}_{ji})\), \(\zeta _i=x_i\varphi _{\tau }(\epsilon _i)\) and \(I_{p\times p}\) is the p-order identity matrix.
Under the Assumption (A2) that \(\tau\)-expectile of the error term is zero, we derive \(\text {E}\left[ \varphi _{\tau }({\overline{\epsilon }}_{i}) \right] =\text {E}\left[ \varphi _{\tau } ({\overline{\epsilon }}_{ji}) \right] =\text {E}\left[ \varphi _{\tau }(\epsilon _{i}) \right] =0\), and
where \(c(\tau )=\tau ^{2}\text {E}\left[ \epsilon _{i}^{2}I(\epsilon _{i}>0) \right] +(1-\tau )^{2}\text {E}\left[ \epsilon _{i}^{2}I(\epsilon _{i}\le 0) \right]\), \(A=\text {Var}\left[ \varphi _{\tau }({\overline{\epsilon }}_{i}) \right]\), \(B=\text {Cov}(\varphi _{\tau }({\overline{\epsilon }}_{i}),\varphi _{\tau }(\epsilon _{i}))\). Therefore,
By the fact that \(D_{i}\) is independent and identically distributed zero-mean random vectors and Assumption (A2), we have
where \(h(m,\tau )=\left[ (m-1)A+(2-2m)B+4mc(\tau ) \right]\) . By central limit theorem, we have
Therefore, \(W_{n}^{\text {T}}u\) is bounded in probability, i.e.
Owing to the fact that
For the SCAD penalty \(p_{\lambda }(\theta )\), we have \(p_{\lambda }^{'}(\theta )\equiv 0\) for \(\theta \in [a\lambda ,+\infty )\), that is, \(p_{\lambda }(\theta )\) is constant if \(\theta \ge a\lambda\). Thus, if \(\lambda =\lambda (n)\longrightarrow 0\), we obtain
uniformly in any compact subset of \(R^{p}\).
By assumption (A2) and (21), (25) and (26), \(n\left[ {\widetilde{L}}_{\text {SCAD}}(\beta _{0}+\frac{u}{\sqrt{n}})- {\widetilde{L}}_{\text {SCAD}}(\beta _{0}) \right]\) is dominated by the term \(g(\alpha )u^{\text {T}}\varSigma u\), when \(\left\| u\right\| _{2}=c\) large enough. Thus, we have \({\widehat{\beta }}^{(\text {SCAD})}\xrightarrow {\ \text {P}\ }\beta _{0}\), as \(n\longrightarrow \infty\). Owing to \(N=nm\), if \(\lambda =\lambda (N)\longrightarrow 0\), we have \({\widehat{\beta }}^{(\text {SCAD})}\xrightarrow {\ \text {P}\ }\beta _{0}\), as \(N\longrightarrow \infty\). \(\square\)
Proof of Theorem 2
In order to prove the sparsity result, we should show that: if \(\lambda =\lambda \left( n \right) \rightarrow 0\), and \(\sqrt{n}\lambda \rightarrow \infty\) as \(n\rightarrow \infty\), then with probability tending to one, any given \(\beta _{1}\) satisfying \(\parallel \beta _{1}-\beta _{10}\parallel _{2}=O_{p}\left( n^{-\frac{1}{2}} \right)\) and any constant c,
i.e., for any \(\delta >0\)
By performing a simple calculation, we obtain
Based on the given conditions \(\left\| \beta _{1}-\beta _{10}\right\| _{2}=O_{p}\left( n^{-\frac{1}{2}} \right)\) and \(\left\| \beta _{2}\right\| \le {cn^{-\frac{1}{2}}}\), and Assumption (A2), we summarize the results as
Under the fact (25) in the proof of Theorem 1, we have
the last equal in (30) relies on the fact that
with \(\varSigma _{22}=\left( a_{ij} \right) _{i,j=1}^{p-s}\), \(\beta _{2}=\left( \beta _{2i} \right) _{i=1}^{p-s}\).
For the SCAD penalty, under the conditions \(\lambda =\lambda \left( n \right) \rightarrow 0\) and \(\sqrt{n}\lambda \rightarrow \infty\) as \(n\rightarrow \infty\), we have
Based on the fact that \(\mathop {\lim }\limits _{\lambda \rightarrow 0}\mathop {\lim }\limits _{\theta \rightarrow 0^{+}}\frac{p_{\lambda }^{'}\left( \theta \right) }{\lambda }=1\). Since \(\sqrt{n}\lambda \rightarrow \infty\) and \(\parallel \beta _{2}\parallel _{2}\le {cn^{-\frac{1}{2}}}\), \(n\sum _{k=s+1}^{p}p_{\lambda }\left( \left| \beta _{k}\right| \right)\) is of higher order than \(\sqrt{n}\). Then \(\mathop {\inf }\limits _{\left\| \beta _{2}\right\| \le {cn^{-\frac{1}{2}}}}{\widetilde{L}}_{\text {SCAD}}\left( \beta _{1},\beta _{2} \right) >{\widetilde{L}}_{\text {SCAD}}\left( \beta _{1},0 \right)\) with probability tending to one. Therefore, we derive result (27) on any compact subset of \(\left\{ \beta :\left\| \beta _{1}-\beta _{10}\right\| _{2}=O_{p}\left( n^{-\frac{1}{2}} \right) , \left\| \beta _{2}\right\| _{2}\le {cn^{-\frac{1}{2}}}\right\}\). As \(N=nm\), it is easy to show that, \(\lambda =\lambda \left( N \right) \rightarrow 0\) and \(\sqrt{N}\lambda \rightarrow \infty\) as \(N\rightarrow \infty\), we have \(\widehat{\beta _{2}}^{(S)}=0\).
In the following, we prove the asymptotic normality of \({\widehat{\beta }}_{1}^{(\text {SCAD})}\). From the definition of \({\widehat{\beta }}^{(\text {SCAD})}\) and the notation (21), we can see that \(\sqrt{n}\left( {\widehat{\beta }}_{1}^{(\text {SCAD})}-{\widehat{\beta }}_{10}^{(\text {SCAD})} \right)\) minimizes \(G_{n}\left( \left( \theta ^{\text {T}},0^{\text {T}} \right) ^{\text {T}} \right) +n\sum _{k=1}^{s}p_{\lambda }\left( \left| \beta _{0k}+\frac{\theta _{k}}{\sqrt{n}}\right| \right)\) with respect to \(\theta\). The proof of Theorem 1 implies that
where \(W_{n,11}\xrightarrow {d}N\left( 0,m^{-1}h\left( m,\tau \right) \varSigma _{11} \right)\). For large n, under condition \(\lambda =\lambda \left( n \right) \rightarrow 0\), we get
uniformly in any compact subset of \(R^{s}\), and this term does not depend on the parameter \(\theta\). Denote
From the above results, we have
Due to \(N=nm\) and the result (35), \({\widehat{\beta }}_{1}^{(\text {SCAD})}\) has the following asymptotic property:
Proof of Theorem 3
Similar to (19) in the proof of Theorem 1, we obtain
Now consider the third term in (36), for \(k=1,\ldots ,s\), the true coefficient \(\beta _{0k}\ne 0\), then \({\widetilde{\omega }}_{k}\xrightarrow {\ \text {P}\ }\left| \beta _{0k}\right| ^{-r}\) and \(\sqrt{n}\left( \left| \beta _{0k}+\frac{u_{k}}{\sqrt{n}}\right| -\left| \beta _{0k}\right| \right) \rightarrow {u_{k}}\text {sgn}\left( \beta _{0k} \right)\). Thus, by Slutsky’s theorem and \(\sqrt{n}\lambda \rightarrow 0\), we have
On the other hand, for \(k=s+1,s+2,\ldots ,p\), we have \(\beta _{0k}=0\), so \(\sqrt{n}\left( \left| \beta _{0k}+\frac{u_{k}}{\sqrt{n}}\right| -\left| \beta _{0k}\right| \right) =\left| u_{k}\right|\) and \(\sqrt{n}\lambda {\widetilde{\omega }}_{k}=n^{\frac{1+r}{2}}\lambda \left( \left| \sqrt{n}{\widehat{\beta }}_{k}\right| \right) ^{-r}\), where \(\sqrt{n}{\widehat{\beta }}_{k}=O_{p}\left( 1 \right)\), so it follows that
Thus, from (36), (37), (38) and Slutsky’s theorem and Assumption (A2), we obtain
where \(u^{1}=(u_{1},u_{2},\ldots ,u_{s})^{\text {T}}\). Noticing that \(n\left[ {\widetilde{L}}_{\text {AL}}\left( \beta _{0}+\frac{u}{\sqrt{n}} \right) -{\widetilde{L}}_{\text {AL}}\left( \beta _{0} \right) \right]\) is convex in u, and \(\text {E}\left( u \right)\) has a unique minimier, and we have
Using (40) and as in the derivation of (35), we obtain
i.e. when \(\sqrt{N}\lambda \rightarrow 0\) and \(N^{\frac{r+1}{2}}\lambda \rightarrow \infty\), we have
Next we show the consistency property of the model selection. For any \(\beta _{1}-\beta _{10}=O_{p}\left( n^{-\frac{1}{2}} \right)\), \(0<\parallel \beta _{2}\parallel <cn^{-\frac{1}{2}}\), similar to (28), we have
By applying the condition \(n^{\frac{1+r}{2}}\lambda \rightarrow \infty\) as \(n\rightarrow \infty\), we obtain
Therefore, the second term of (41) goes to \(-\infty\) as \(n\rightarrow \infty\), which in turn implies that \(n\left[ {\widetilde{L}}_{\text {AL}}\left( \left( \beta _{1}^{\text {T}},0^{\text {T}} \right) ^{\text {T}} \right) -{\widetilde{L}}_{\text {AL}}\left( \left( \beta _{1}^{\text {T}},\beta _{2}^{\text {T}} \right) ^{\text {T}} \right) \right] <0\) for large n. Similarly to the explanation of the proof process of Theorem 2 in the corresponding part, we have \({\widehat{\beta }}_{2}^{(\text {AL})}=0\). That is, based on the assumption \(\sqrt{N}\lambda \rightarrow \infty\) and \(N^{\frac{1+r}{2}}\lambda \rightarrow \infty\), as \(N\rightarrow \infty\), we have \({\widehat{\beta }}_{2}^{(\text {AL})}=0\) \(\square\)
Rights and permissions
About this article
Cite this article
Pan, Y. Distributed optimization and statistical learning for large-scale penalized expectile regression. J. Korean Stat. Soc. 50, 290–314 (2021). https://doi.org/10.1007/s42952-020-00074-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42952-020-00074-5