Multi-consensus decentralized primal-dual fixed point algorithm for distributed learning

Tang, Kejie; Liu, Weidong; Mao, Xiaojun

doi:10.1007/s10994-024-06537-8

Multi-consensus decentralized primal-dual fixed point algorithm for distributed learning

Published: 08 April 2024

(2024)
Cite this article

Machine Learning Aims and scope Submit manuscript

150 Accesses
1 Altmetric
Explore all metrics

Abstract

Decentralized distributed learning has recently attracted significant attention in many applications in machine learning and signal processing. To solve a decentralized optimization with regularization, we propose a Multi-consensus Decentralized Primal-Dual Fixed Point (MD-PDFP) algorithm. We apply multiple consensus steps with the gradient tracking technique to extend the primal-dual fixed point method over a network. The communication complexities of our procedure are given under certain conditions. Moreover, we show that our algorithm is consistent under general conditions and enjoys global linear convergence under strong convexity. With some particular choices of regularizations, our algorithm can be applied to decentralized machine learning applications. Finally, several numerical experiments and real data analyses are conducted to demonstrate the effectiveness of the proposed algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hybrid ADMM: a unifying and fast approach to decentralized optimization

Article Open access 04 December 2018

An Acceleration of Decentralized SGD Under General Assumptions with Low Stochastic Noise

Distributed multi-task classification: a decentralized online learning approach

Article 02 November 2017

Availability of data and materials

Only public datasets, in https://people.eecs.berkeley.edu/~mlustig/Software.html and http://www.cs.toronto.edu/~kriz/cifar.html, are used.

Code availability

The demo codes of our simulations and real applications can be found at https://github.com/kejie-tang/MD-PDFP.

References

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723.
Article MathSciNet Google Scholar
Alaviani, S. S. & Kelkar, A. G. (2021). Parallel alternating direction primal-dual (padpd) algorithm for centralized optimization. In 2021 60th IEEE Conference on Decision and Control (CDC), (pp. 962–967). IEEE.
Alaviani, S. S., Kelkar, A. & Vaidya, U. (2022). A fully parallel distributed algorithm for non-smooth convex optimization with coupled constraints: Application to linear algebraic equations. In 2022 American control conference (ACC), (pp. 920–925). IEEE
Alexandru, A. B., Tsiamis, A. & Pappas, G. J. (2021). Encrypted distributed lasso for sparse data predictive control. In 2021 60th IEEE conference on decision and control (CDC), (pp. 4901–4906). IEEE.
Alghunaim, S. A., Ryu, E. K., Yuan, K., & Sayed, A. H. (2020). Decentralized proximal gradient algorithms with linear convergence rates. IEEE Transactions on Automatic Control, 66(6), 2787–2794.
Article MathSciNet Google Scholar
Ali, M. S., Vecchio, M., Pincheira, M., Dolui, K., Antonelli, F., & Rehmani, M. H. (2018). Applications of blockchains in the internet of things: A comprehensive survey. IEEE Communications Surveys and Tutorials, 21(2), 1676–1717.
Article Google Scholar
Ammanouil, R., Ferrari, A., Flamary, R., Ferrari, C. & Mary, D. (2017). Multi-frequency image reconstruction for radio-interferometry with self-tuned regularization parameters. In 2017 25th European signal processing conference (EUSIPCO), (pp. 1435–1439). IEEE.
Blaimer, M., Breuer, F., Mueller, M., Heidemann, R. M., Griswold, M. A., & Jakob, P. M. (2004). Smash, sense, pils, grappa: How to choose the optimal method. Topics in Magnetic Resonance Imaging, 15(4), 223–236.
Article Google Scholar
Bullo, F., Cortés, J., & Martinez, S. (2009). Distributed control of robotic networks: A mathematical approach to motion coordination algorithms (Vol. 27). Princeton University Press.
Book Google Scholar
Cao, Y., Yu, W., Ren, W., & Chen, G. (2012). An overview of recent progress in the study of distributed multi-agent coordination. IEEE Transactions on Industrial Informatics, 9(1), 427–438.
Article Google Scholar
Chang, T.-H., Hong, M., & Wang, X. (2014). Multi-agent distributed optimization via inexact consensus admm. IEEE Transactions on Signal Processing, 63(2), 482–497.
Article MathSciNet Google Scholar
Chen, C., Zhang, J., Shen, L., Zhao, P. & Luo, Z. (2021). Communication efficient primal-dual algorithm for nonconvex nonsmooth distributed optimization. In International conference on artificial intelligence and statistics, (pp. 1594–1602). PMLR.
Chen, P., Huang, J., & Zhang, X. (2013). A primal-dual fixed point algorithm for convex separable minimization with applications to image restoration. Inverse Problems, 29(2), 025011.
Article MathSciNet Google Scholar
Combettes, P. L., & Wajs, V. R. (2005). Signal recovery by proximal forward-backward splitting. Multiscale Modeling and Simulation, 4(4), 1168–1200.
Article MathSciNet Google Scholar
Eckstein, J., & Bertsekas, D. P. (1992). On the Douglas–Rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 55(1), 293–318.
Article MathSciNet Google Scholar
Esser, E., Zhang, X., & Chan, T. F. (2010). A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science. SIAM Journal on Imaging Sciences, 3(4), 1015–1046.
Article MathSciNet Google Scholar
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1.
Article Google Scholar
Gao, H. (2022). Decentralized stochastic gradient descent ascent for finite-sum minimax problems. arXiv preprint arXiv:2212.02724
Giannakis, G. B., Kekatos, V., Gatsis, N., Kim, S.-J., Zhu, H., & Wollenberg, B. F. (2013). Monitoring and optimization for power grids: A signal processing perspective. IEEE Signal Processing Magazine, 30(5), 107–128.
Article Google Scholar
Goldstein, T., & Osher, S. (2009). The split Bregman method for l1-regularized problems. SIAM Journal on Imaging Sciences, 2(2), 323–343.
Article MathSciNet Google Scholar
Hallac, D., Leskovec, J. & Boyd, S. (2015). Network lasso: Clustering and optimization in large graphs. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, (pp. 387–396).
Hendrikx, H., Bach, F. & Massoulié, L. (2019). An accelerated decentralized stochastic proximal algorithm for finite sums. Advances in Neural Information Processing Systems32.
Huber, P. J. (1992). Robust estimation of a location parameter. Breakthroughs in statistics: Methodology and Distribution, 492–518.
Jacob, L., Obozinski, G. &, Vert, J.-P. (2009). Group lasso with overlap and graph lasso. In Proceedings of the 26th annual international conference on machine learning, (pp. 433–440).
Johnstone, P. R., & Eckstein, J. (2021). Single-forward-step projective splitting: Exploiting cocoercivity. Computational Optimization and Applications, 78(1), 125–166.
Article MathSciNet Google Scholar
Kakade, S., Shamir, O., Sindharan, K. & Tewari, A. (2010). Learning exponential families in high-dimensions: Strong convexity and sparsity. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, (pp. 381–388). JMLR Workshop and Conference Proceedings.
Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images.
Lan, S., Wang, Z., Roy-Chowdhury, A. K., Wei, E. & Zhu, Q. (2020). Distributed multi-agent video fast-forwarding. In Proceedings of the 28th ACM international conference on multimedia, (pp. 1075–1084).
Li, B., Cen, S., Chen, Y. & Chi, Y. (2020). Communication-efficient distributed optimization in networks with gradient tracking and variance reduction. In International conference on artificial intelligence and statistics, (pp. 1662–1672). PMLR.
Liao, L., Shen, L., Duan, J., Kolar, M. & Tao, D. (2022). Local adagrad-type algorithm for stochastic convex-concave optimization. Machine Learning, 1–20.
Li, H., Fang, C., Yin, W., & Lin, Z. (2020). Decentralized accelerated gradient methods with increasing penalty parameters. IEEE Transactions on Signal Processing, 68, 4855–4870.
Article MathSciNet Google Scholar
Ling, Q., & Tian, Z. (2010). Decentralized sparse signal recovery for compressive sleeping wireless sensor networks. IEEE Transactions on Signal Processing, 58(7), 3816–3827.
Article MathSciNet Google Scholar
Lin, F.-H., Kwong, K. K., Belliveau, J. W., & Wald, L. L. (2004). Parallel imaging reconstruction using automatic regularization. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine, 51(3), 559–567.
Article Google Scholar
Li, Z., Shi, W., & Yan, M. (2019). A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates. IEEE Transactions on Signal Processing, 67(17), 4494–4506.
Article MathSciNet Google Scholar
Liu, J., & Morse, A. S. (2011). Accelerated linear iterations for distributed averaging. Annual Reviews in Control, 35(2), 160–165.
Article Google Scholar
Lustig, M. (2017). ESPIRiT: Reference implementation of compressed sensing and parallel imaging in Matlab.
Micchelli, C. A., Shen, L., & Xu, Y. (2011). Proximity algorithms for image models: Denoising. Inverse Problems, 27(4), 045009.
Article MathSciNet Google Scholar
Nedić, A., Olshevsky, A., & Rabbat, M. G. (2018). Network topology and communication-computation tradeoffs in decentralized optimization. Proceedings of the IEEE, 106(5), 953–976.
Article Google Scholar
Nedic, A., Olshevsky, A., & Shi, W. (2017). Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization, 27(4), 2597–2633.
Article MathSciNet Google Scholar
Nedic, A., & Ozdaglar, A. (2009). Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1), 48–61.
Article MathSciNet Google Scholar
Olshevsky, A. (2014). Linear time average consensus on fixed graphs and implications for decentralized optimization and multi-agent control. arXiv preprint arXiv:1411.4186
Olshevsky, A. (2017). Linear time average consensus and distributed optimization on fixed graphs. SIAM Journal on Control and Optimization, 55(6), 3990–4014.
Article MathSciNet Google Scholar
Ouyang, H., He, N., Tran, L. & Gray, A. (2013). Stochastic alternating direction method of multipliers. In International conference on machine learning, (pp. 80–88). PMLR.
Qian, Y., Ye, M., & Zhou, J. (2012). Hyperspectral image classification based on structured sparse logistic regression and three-dimensional wavelet texture features. IEEE Transactions on Geoscience and Remote Sensing, 51(4), 2276–2291.
Article Google Scholar
Qu, G., & Li, N. (2017). Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, 5(3), 1245–1260.
Article MathSciNet Google Scholar
Qu, G., & Li, N. (2019). Accelerated distributed Nesterov gradient descent. IEEE Transactions on Automatic Control, 65(6), 2566–2581.
Article MathSciNet Google Scholar
Rudin, L. I., Osher, S., & Fatemi, E. (1992). Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1–4), 259–268.
Article MathSciNet Google Scholar
Scaman, K., Bach, F., Bubeck, S., Lee, Y. T. & Massoulié, L. (2017). Optimal algorithms for smooth and strongly convex distributed optimization in networks. In International conference on machine learning, (pp. 3027–3036). PMLR.
Shi, W., Ling, Q., Wu, G., & Yin, W. (2015). Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2), 944–966.
Article MathSciNet Google Scholar
Shi, W., Ling, Q., Wu, G., & Yin, W. (2015). A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing, 63(22), 6013–6023.
Article MathSciNet Google Scholar
Simonyan, K. & Zisserman, A.(2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Sridhar, V., Wang, X., Buzzard, G. T., & Bouman, C. A. (2020). Distributed iterative CT reconstruction using multi-agent consensus equilibrium. IEEE Transactions on Computational Imaging, 6, 1153–1166.
Article Google Scholar
Sundhar Ram, S., Nedić, A., & Veeravalli, V. V. (2010). Distributed stochastic subgradient projection algorithms for convex optimization. Journal of Optimization Theory and Applications, 147(3), 516–545.
Article MathSciNet Google Scholar
Sun, Y., Scutari, G., & Daneshmand, A. (2022). Distributed optimization based on gradient tracking revisited: Enhancing convergence rate via surrogation. SIAM Journal on Optimization, 32(2), 354–385.
Article MathSciNet Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.
MathSciNet Google Scholar
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., & Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1), 91–108.
Article MathSciNet Google Scholar
Tibshirani, R. J., & Taylor, J. (2011). The solution path of the generalized lasso. The Annals of Statistics, 39(3), 1335–1371.
Article MathSciNet Google Scholar
Tsianos, K. I., Lawlor, S. & Rabbat, M. G. (2012). Consensus-based distributed optimization: Practical issues and applications in large-scale machine learning. In 2012 50th annual allerton conference on communication, control, and computing (allerton), (pp. 1543–1550). IEEE.
Uribe, C. A., Lee, S., Gasnikov, A. & Nedić, A. (2020). A dual approach for optimal algorithms in distributed optimization over networks. In 2020 Information theory and applications workshop (ITA), (pp. 1–37). IEEE.
Wang, X., Sridhar, V., Ronaghi, Z., Thomas, R., Deslippe, J., Parkinson, D., Buzzard, G. T., Midkiff, S. P., Bouman, C. A. & Warfield, S. K. (2019). Consensus equilibrium framework for super-resolution and extreme-scale ct reconstruction. In Proceedings of the international conference for high performance computing, networking, storage and analysis, (pp. 1–23).
Wang, H., Liao, X., & Huang, T. (2013). Accelerated consensus to accurate average in multi-agent networks via state prediction. Nonlinear Dynamics, 73, 551–563.
Article MathSciNet Google Scholar
Xin, R., Pu, S., Nedić, A., & Khan, U. A. (2020). A general framework for decentralized optimization with first-order methods. Proceedings of the IEEE, 108(11), 1869–1889.
Article Google Scholar
Xu, J., Tian, Y., Sun, Y., & Scutari, G. (2021). Distributed algorithms for composite optimization: Unified framework and convergence analysis. IEEE Transactions on Signal Processing, 69, 3555–3570.
Article MathSciNet Google Scholar
Ye, H., Luo, L., Zhou, Z., & Zhang, T. (2023). Multi-consensus decentralized accelerated gradient descent. Journal of Machine Learning Research, 24(306), 1–50.
Ye, X.-J. (2015). Distributed and consensus optimization for non-smooth image reconstruction. Journal of the Operations Research Society of China, 3(2), 117–138.
Article MathSciNet Google Scholar
Ye, H., Zhou, Z., Luo, L., & Zhang, T. (2020). Decentralized accelerated proximal gradient descent. Advances in Neural Information Processing Systems, 33, 18308–18317.
Google Scholar
Yi, J.-W., Chai, L. & Zhang, J. (2023). Convergence rate of accelerated average consensus with local node memory: Optimization and analytic solutions. IEEE Transactions on Automatic Control.
Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49–67.
Article MathSciNet Google Scholar
Yuan, K., Ling, Q., & Yin, W. (2016). On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3), 1835–1854.
Article MathSciNet Google Scholar
Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., Hoang, N. & Khazaeni, Y. (2019). Bayesian nonparametric federated learning of neural networks. In International Conference on Machine Learning, (pp. 7252–7261). PMLR
Zhang, X., Liu, J. & Zhu, Z. (2022). Learning coefficient heterogeneity over networks: A distributed spanning-tree-based fused-lasso regression. Journal of the American Statistical Association (just-accepted), 1–29.
Zhang, Y., Qiu, M. & Gao, H. (2023). Communication-efficient stochastic gradient descent ascent with momentum algorithms. In International joint conference on artificial intelligence.
Zhu, M., Shen, L., Du, B., & Tao, D. (2024). Stability and generalization of the decentralized stochastic gradient descent ascent algorithm. Advances in Neural Information Processing Systems, 36.

Download references

Funding

Weidong Liu and Xiaojun Mao are the co-corresponding authors. Weidong Liu’s research is supported by NSFC Grant No. 11825104, Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102). Xiaojun Mao’s research is supported by NSFC Grant No. 12371273, Shanghai Rising-Star Program 23QA1404600 and Young Elite Scientists Sponsorship Program by CAST (2023QNRC001).

Author information

Authors and Affiliations

School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai, 200240, China
Kejie Tang, Weidong Liu & Xiaojun Mao
MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, 200240, China
Weidong Liu
Ministry of Education Key Laboratory of Scientific and Engineering Computing, Shanghai Jiao Tong University, Shanghai, 200240, China
Xiaojun Mao

Authors

Kejie Tang
View author publications
You can also search for this author in PubMed Google Scholar
Weidong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaojun Mao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

KT developed the theory, performed the computations, wrote the original preparation draft, and edited the writing. WL and XM developed the theory, conceived the presented idea, verified the analytical methods, supervised the findings of this work, and reviewed and edited the writing. All authors discussed the results and contributed to the final manuscript.

Corresponding authors

Correspondence to Weidong Liu or Xiaojun Mao.

Ethics declarations

Conflict of interest

There is no Conflict of interest.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Editor: Bo Han.

Appendices

Appendix A: Proof of Theorem 5

We first introduce the concept of Nonexpansive, which is an important property in our proof.

Definition 2

(Nonexpansive operators (Rudin et al., 1992)) An operator $T:{\mathbb {R}}^{p}\rightarrow {\mathbb {R}}^{p}$ is nonexpansive if and only if it satisfies $\Vert T{\textbf{x}}-T{\textbf{y}}\Vert _2\leqslant \Vert {\textbf{x}}-{\textbf{y}}\Vert _2$ for all ${\textbf{x}},{\textbf{y}}\in {\mathbb {R}}^{p}$.

Lemma 6

(Lemma 2.4 of Combettes and Wajs (2005)) Let h be a function in $\Gamma _{0}({\mathbb {R}}^{p})$. Then, $\text {prox}_{h}$ and ${\textbf{I}}_p-\text {prox}_h$ are both firmly nonexpansive operators.

The $\text {PDFP}^2\text {O}$ (Chen et al., 2013) algorithm is described as:

$$\begin{aligned} \left\{ \begin{array}{ll} {\textbf{v}}_{t+1}&{}=T_1\left( {\textbf{v}}_t, {\textbf{x}}_t\right) \\ &{}=\left( {\textbf{I}}_q-{\text {prox}}_{\frac{\gamma }{\lambda } h}\right) \left( {\textbf{B}}\left( {\textbf{x}}_t-\gamma \nabla f\left( {\textbf{x}}_t\right) \right) +\left( {\textbf{I}}_p-\lambda {\textbf{B}}{\textbf{B}}^\top \right) {\textbf{v}}_t\right) ,\\ {\textbf{x}}_{t+1}&{}=T_2\left( {\textbf{v}}_t, {\textbf{x}}_t\right) \\ &{}={\textbf{x}}_t-\gamma \nabla f\left( {\textbf{x}}_t\right) -\lambda {\textbf{B}}^\top \circ T_1\left( {\textbf{v}}_t, {\textbf{x}}_t\right) \\ &{}={\textbf{x}}_t-\gamma \nabla f\left( {\textbf{x}}_t\right) -\lambda {\textbf{B}}^\top {\textbf{v}}_{t+1} . \end{array}\right. \end{aligned}$$

(A1)

The aggregation form is defined as follows:

$$\begin{aligned} {\textbf{T}}_j({\textbf{V}}_t,{\textbf{X}}_t)&= \left( T_j({\textbf{v}}_1(t),{\textbf{x}}_1(t)),\cdots , T_j({\textbf{v}}_n(t),{\textbf{x}}_n(t))\right) ^\top , \quad j=1,2, \end{aligned}$$

and their average

$$\begin{aligned} \bar{{\textbf{T}}}_j({\textbf{V}}_t,{\textbf{X}}_t) = \frac{1}{n}\sum _{i=1}^n T_j({\textbf{v}}_i(t),{\textbf{x}}_i(t)),\quad j=1,2. \end{aligned}$$

For ${\textbf{u}}=({\textbf{v}},{\textbf{x}})\in {\mathbb {R}}^q\times {\mathbb {R}}^p$, the $\lambda$-norm is defined as $\Vert {\textbf{u}}\Vert _\lambda = \sqrt{\Vert {\textbf{x}}\Vert _2^2+\lambda \Vert {\textbf{v}}\Vert _2^2}$. Then for both ${\textbf{v}}$ and ${\textbf{x}}$, it holds the property of nonexpansive operator in (A1):

Corollary 7

(Corollary 3.1 of Chen et al. (2013)) If $0<\gamma<2/L, 0<\lambda \leqslant 1/\lambda _{\max }^2$, then $(T_1,T_2)$ is nonexpansive under the norm $\Vert \cdot \Vert _\lambda$.

Lemma 6 and Corollary 7 give that ${\textbf{T}}_1({\textbf{V}},{\textbf{X}})= \left( T_1({\textbf{v}}_1,{\textbf{x}}_1),\cdots , T_1({\textbf{v}}_n,{\textbf{x}}_n)\right) ^\top$ is also nonexpansive. With initialization ${\textbf{g}}_i(0)=\nabla f_i({\textbf{x}}_i(0))$, for all t, it holds that

$$\begin{aligned} \bar{{\textbf{g}}}_t&= \nabla {\bar{F}}({\textbf{X}}_{t}). \end{aligned}$$

(A2)

The iterations of ${\textbf{X}}_t, {\textbf{V}}_t$ in Algorithm 1 indicate that:

$$\begin{aligned} \bar{{\textbf{v}}}_{t+1}&= \bar{{\textbf{T}}}_1({\textbf{V}}_t, {\textbf{X}}_t) = T_1(\bar{{\textbf{v}}}_t,\bar{{\textbf{x}}}_t) + \left( \bar{{\textbf{T}}}_1({\textbf{V}}_t, {\textbf{X}}_t) - T_1(\bar{{\textbf{v}}}_t,\bar{{\textbf{x}}}_t)\right) \triangleq T_1(\bar{{\textbf{v}}}_t,\bar{{\textbf{x}}}_t) + \varvec{\epsilon }_1^t,\\ \bar{{\textbf{x}}}_{t+1}&=\bar{{\textbf{x}}}_t -\gamma \nabla {\bar{F}}({\textbf{X}}_t) - \lambda {\textbf{B}}^\top \bar{{\textbf{v}}}_{t+1},\\&= \left( \bar{{\textbf{x}}}_t -\gamma \nabla f(\bar{{\textbf{x}}}_t) - \lambda {\textbf{B}}^\top T_1(\bar{{\textbf{v}}}_t, \bar{{\textbf{x}}}_t)\right) + \gamma \left( \nabla f(\bar{{\textbf{x}}}_t)-\nabla {\bar{F}}({\textbf{X}}_t)\right) + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1^t ,\\&\triangleq T_2(\bar{{\textbf{v}}}_t, \bar{{\textbf{x}}}_t) + \gamma \varvec{\epsilon }_2^t + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1^t. \end{aligned}$$

Therefore, for the average $\bar{{\textbf{v}}}_t,\bar{{\textbf{x}}}_t$, their iterations are

$$\begin{aligned} \left\{ \begin{array}{l} \bar{{\textbf{v}}}_{t+1}= T_1(\bar{{\textbf{v}}}_t,\bar{{\textbf{x}}}_t) + \varvec{\epsilon }_1^t,\\ \bar{{\textbf{x}}}_{t+1}= T_2(\bar{{\textbf{v}}}_t, \bar{{\textbf{x}}}_t) + \gamma \varvec{\epsilon }_2^t + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1^t. \end{array} \right. \end{aligned}$$

(A3)

In Chen et al. (2013), the solution to the problem (1) is a fixed point of the $\text {PDFP}^{2O}$ algorithm and the main results are stated in the following proposition.

Proposition 8

(Theorem 3.1 of Chen et al. (2013)) Let $\lambda$ and $\gamma$ be positive numbers. Suppose ${\textbf{x}}^*$ is a solution to the problem (1). Then, there exists ${\textbf{v}}^* \in {\mathbb {R}}^q$ such that

$$\begin{aligned} \left\{ \begin{array}{l} {\textbf{v}}^*=T_1\left( {\textbf{v}}^*, {\textbf{x}}^*\right) , \\ {\textbf{x}}^*=T_2\left( {\textbf{v}}^*, {\textbf{x}}^*\right) . \end{array}\right. \end{aligned}$$

In other words, ${\textbf{u}}^*=\left( {\textbf{v}}^*, {\textbf{x}}^*\right)$ is a fixed point of $(T_1,T_2)$. Conversely, if ${\textbf{u}}^* \in {\mathbb {R}}^q \times {\mathbb {R}}^p$ is a fixed point of T, then ${\textbf{u}}^*=\left( {\textbf{v}}^*, {\textbf{x}}^*\right) , {\textbf{v}}^* \in {\mathbb {R}}^q, {\textbf{x}}^* \in {\mathbb {R}}^p$, and ${\textbf{x}}^*$ is a solution to problem (1).

The main part of the convergence guarantee is controlling the error terms $\varvec{\epsilon }_1^t$ and $\varvec{\epsilon }_2^t$.

Lemma 9

Suppose that $0<\gamma<2/L, 0<\lambda \leqslant 1/\lambda _{\max }^2$. For the error items $\varvec{\epsilon }_j^t, j=1,2$ in iteration (A3), it holds that:

$$\begin{aligned} \left\| \varvec{\epsilon }_1^t\right\| _2 \leqslant \frac{\lambda _{\max }(3+2/L)+1}{\sqrt{n}}\left\| {\textbf{z}}_t\right\| _\infty ,\quad \quad \text {and}\quad \quad \left\| \varvec{\epsilon }_2^t\right\| _2 \leqslant \frac{L}{\sqrt{n}} \left\| {\textbf{z}}_t\right\| _\infty , \end{aligned}$$

where ${\textbf{z}}_t = [\Vert {\textbf{V}}_t-{\textbf{1}}\bar{{\textbf{v}}}_t^\top \Vert _2, \Vert {\textbf{X}}_t-{\textbf{1}}\bar{{\textbf{x}}}_t^\top \Vert _2, \Vert {\textbf{G}}_t- {\textbf{1}}\bar{{\textbf{g}}}_t^\top \Vert _2]^\top$.

Lemma 9 illustrates that when ${\textbf{V}}_t, {\textbf{X}}_t, {\textbf{G}}_t$ and their means $\bar{{\textbf{v}}}_t,\bar{{\textbf{x}}}_t,\bar{{\textbf{g}}}_t$ are close enough, the iteration of $\bar{{\textbf{v}}}_t,\bar{{\textbf{x}}}_t$ can be viewed as $\text {PDFP}^{2O}$ plus small error terms. Let us consider how to measure the difference between ${\textbf{V}}_t, {\textbf{X}}_t, {\textbf{G}}_t$ and theirs means $\bar{{\textbf{v}}}_t,\bar{{\textbf{x}}}_t,\bar{{\textbf{g}}}_t$ with an increase in computation time t and communications K.

Lemma 10

Suppose $0<\gamma<2/L, 0<\lambda \leqslant 1/\lambda _{\max }^2$. Then, the t-th iteration in Algorithm 1 holds that

$$\begin{aligned} {\textbf{z}}_{t+1}&\leqslant \chi ^K {\varvec{A}}{\textbf{z}}_t + \chi ^K L\sqrt{n}\left[ 0,0, \gamma L \left\| \bar{{\textbf{x}}}_t-{\textbf{x}}^*\right\| _2+\lambda \lambda _{\max }\left\| \bar{{\textbf{v}}}_{t+1}-{\textbf{v}}^*\right\| _2\right] ^\top , \end{aligned}$$

where ${\textbf{z}}_t = [\Vert {\textbf{V}}_t-{\textbf{1}}\bar{{\textbf{v}}}_t^\top \Vert _2,\Vert {\textbf{X}}_t-{\textbf{1}}\bar{{\textbf{x}}}_t^\top \Vert _2,\Vert {\textbf{G}}_t- {\textbf{1}}\bar{{\textbf{g}}}_t^\top \Vert _2]^\top$ and

$$\begin{aligned} {\varvec{A}}= \left( \begin{array}{ccc} 2&{} 2 \lambda _{\max } &{} 2 \gamma \lambda _{\max } \\ 2\lambda \lambda _{\max } &{} 3 &{} 3 \gamma \\ 4L \lambda \lambda _{\max } &{} 6L &{} 7 \end{array} \right) . \end{aligned}$$

Then $\Vert {\varvec{A}}\Vert _\infty \leqslant 7+6\,L+ 2\lambda _{\max } + 3\gamma (1+\lambda _{\max }) + 4\lambda \lambda _{\max } (1+L)$. Therefore, with

$$\begin{aligned} K_t = \frac{\log (1/\alpha _t)+\log \left( 7+6L+ 2\lambda _{\max } + 3\gamma (1+\lambda _{\max }) + 4\lambda \lambda _{\max } (1+L)\right) }{\log \left( 1/\chi \right) }, \end{aligned}$$

for any $\alpha _t>0$, we have

$$\begin{aligned} \Vert {\textbf{z}}_{t+1}\Vert _\infty&\leqslant \alpha _t\Vert {\textbf{z}}_t\Vert _\infty + \alpha _t \sqrt{n}\left( \gamma L \Vert \bar{{\textbf{x}}}_t-{\textbf{x}}^*\Vert _2+ \lambda \lambda _{\max }\Vert \bar{{\textbf{v}}}_{t+1}-{\textbf{v}}^*\Vert _2\right) . \end{aligned}$$

After controlling for the error terms, we can obtain a convergence analysis of our algorithm. As in Chen et al. (2013), for $0<\gamma< 2/L, 0<\lambda \leqslant 1/\lambda _{\max }^2$, we denote

$$\begin{aligned} g({\textbf{x}})={\textbf{x}}-\gamma \nabla f({\textbf{x}}),\quad \quad \text {and}\quad \quad {\textbf{M}}={\textbf{I}}_q-\lambda {\textbf{B}}{\textbf{B}}^\top . \end{aligned}$$

The ${\textbf{M}}$ semi-norm of a vector ${\textbf{v}}$ is defined as $\Vert {\textbf{v}}\Vert _{\textbf{M}}= \sqrt{\langle {\textbf{v}},{\textbf{M}}{\textbf{v}}\rangle }$. Recall that $\Vert {\textbf{u}}\Vert _\lambda = \sqrt{\Vert {\textbf{x}}\Vert _2^2+\lambda \Vert {\textbf{v}}\Vert _2^2}$. Then, the iteration of Algorithm 1 can be bounded by the following theorem.

Theorem 11

For any two elements $\bar{{\textbf{u}}}_1 = (\bar{{\textbf{v}}}_1,\bar{{\textbf{x}}}_1),\bar{{\textbf{u}}}_2 = (\bar{{\textbf{v}}}_2,\bar{{\textbf{x}}}_2)\in {\mathbb {R}}^q\times {\mathbb {R}}^p$, we have that

$$\begin{aligned}&\Vert \left( T_1(\bar{{\textbf{u}}}_1)+\varvec{\epsilon }_1, T_2(\bar{{\textbf{u}}}_1)+\gamma \varvec{\epsilon }_2 + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1\right) - \left( T_1(\bar{{\textbf{u}}}_2), T_2(\bar{{\textbf{u}}}_2)\right) \Vert _\lambda ^2 \nonumber \\&\leqslant \Vert \left( T_1(\bar{{\textbf{u}}}_1), T_2(\bar{{\textbf{u}}}_1)\right) - \left( T_1(\bar{{\textbf{u}}}_2), T_2(\bar{{\textbf{u}}}_2)\right) \Vert _\lambda ^2 \nonumber \\&+ 2\Vert \gamma \varvec{\epsilon }_2 + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1\Vert _2\Vert T_2(\bar{{\textbf{u}}}_1)+\gamma \varvec{\epsilon }_2 + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1 - T_2(\bar{{\textbf{u}}}_2)\Vert _2-\Vert \gamma \varvec{\epsilon }_2 + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1\Vert _2^2\nonumber \\&+ 2\lambda \Vert \varvec{\epsilon }_1\Vert \Vert T_1(\bar{{\textbf{u}}}_1)+\varvec{\epsilon }_1-T_1(\bar{{\textbf{u}}}_2)\Vert - \lambda \Vert \varvec{\epsilon }_1\Vert ^2. \end{aligned}$$

(A4)

On the right hand side, the upper bound of $\Vert \left( T_1(\bar{{\textbf{u}}}_1), T_2(\bar{{\textbf{u}}}_1)\right) - \left( T_1(\bar{{\textbf{u}}}_2), T_2(\bar{{\textbf{u}}}_2)\right) \Vert _\lambda ^2$ can be obtained by analyzing $\text {PDFP}^{2O}$ in Proposition 12. Combined with the previous analysis, we can obtain the algorithm’s consistency without requiring strong convexity.

Proof of Theorem 5

We first prove that $(\bar{{\textbf{v}}}_t, \bar{{\textbf{x}}}_t,\bar{{\textbf{g}}}_t)$ converges to $({\textbf{v}}^*,{\textbf{x}}^*,{\textbf{g}}^*)$ and that ${\textbf{x}}^*$ is a solution to problem (1). We then prove $({\textbf{v}}_i(t), {\textbf{x}}_i(t),{\textbf{g}}_i(t))$ converges to $(\bar{{\textbf{v}}}_t, \bar{{\textbf{x}}}_t,\bar{{\textbf{g}}}_t)$ for all $i\in {\mathcal {V}}.$

Let ${\textbf{u}}^*=({\textbf{v}}^*,{\textbf{x}}^*)$ be a fixed point of $(T_1,T_2)$. Combining inequalities(A4) and (A17) , we have that

$$\begin{aligned} \Vert \bar{{\textbf{u}}}_{t+1}-{\textbf{u}}^*\Vert _\lambda ^2&\leqslant \Vert \bar{{\textbf{u}}}_t-{\textbf{u}}^*\Vert _\lambda ^2 \nonumber \\&- \gamma \left( \frac{2}{L}-\gamma \right) \Vert \nabla f(\bar{{\textbf{x}}}_t)-\nabla f({\textbf{x}}^*)\Vert _2^2 - \Vert \lambda {\textbf{B}}^\top (\bar{{\textbf{v}}}_t-{\textbf{v}}^*)\Vert _2^2 - \lambda \Vert T_1(\bar{{\textbf{u}}}_t)-\bar{{\textbf{v}}}_t\Vert _{\textbf{M}}^2\nonumber \\&+ 2\Vert \gamma \varvec{\epsilon }_2^t + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1^t\Vert _2\Vert \bar{{\textbf{x}}}_{t+1} - {\textbf{x}}^*\Vert _2-\Vert \gamma \varvec{\epsilon }_2^t + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1^t\Vert _2^2\nonumber \\&+ 2\lambda \Vert \varvec{\epsilon }_1^t\Vert _2 \Vert \bar{{\textbf{v}}}_{t+1}-{\textbf{v}}^*\Vert _2- \lambda \Vert \varvec{\epsilon }_1^t\Vert _2^2\nonumber \\&\leqslant \Vert \bar{{\textbf{u}}}_t-{\textbf{u}}^*\Vert _\lambda ^2 \nonumber \\&- \gamma \left( \frac{2}{L}-\gamma \right) \Vert \nabla f(\bar{{\textbf{x}}}_t)-\nabla f({\textbf{x}}^*)\Vert _2^2 - \Vert \lambda {\textbf{B}}^\top (\bar{{\textbf{v}}}_t-{\textbf{v}}^*)\Vert _2^2 - \lambda \Vert T_1(\bar{{\textbf{u}}}_t)-\bar{{\textbf{v}}}_t\Vert _{\textbf{M}}^2\nonumber \\&+ 2\sqrt{\Vert \gamma \varvec{\epsilon }_2^t + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1^t\Vert _2^2+\lambda \Vert \varvec{\epsilon }_1^t\Vert ^2_2}\Vert \bar{{\textbf{u}}}_{t+1} - {\textbf{u}}^*\Vert _2-\Vert \gamma \varvec{\epsilon }_2^t + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1^t\Vert _2^2- \lambda \Vert \varvec{\epsilon }_1^t\Vert _2^2, \end{aligned}$$

(A5)

where the last inequality is given by Cauchy-Schwarz inequality.

We define $\xi _t^2 = \gamma \left( \frac{2}{L}-\gamma \right) \Vert \nabla f(\bar{{\textbf{x}}}_t)-\nabla f({\textbf{x}}^*)\Vert _2^2+ \Vert \lambda {\textbf{B}}^\top (\bar{{\textbf{v}}}_t-{\textbf{v}}^*)\Vert _2^2 + \lambda \Vert T_1(\bar{{\textbf{u}}}_t)-\bar{{\textbf{v}}}_t\Vert _{\textbf{M}}^2$ and

$$\begin{aligned} \eta _t = \sqrt{\Vert \gamma \varvec{\epsilon }_2^t + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1^t\Vert _2^2+\lambda \Vert \varvec{\epsilon }_1^t\Vert _2^2}. \end{aligned}$$

(A6)

Then, by Lemma 9, under the conditions that $\gamma <2/L, \lambda \leqslant 1/\lambda _{\max }^2$, it holds that

$$\begin{aligned} \eta _t&\leqslant \left( \lambda \lambda _{\max }+\sqrt{\lambda }\right) \Vert \varvec{\epsilon }_1^t\Vert _2+ \gamma \Vert \varvec{\epsilon }_2^t\Vert _2\nonumber \\&\leqslant \frac{2}{\lambda _{\max }} \frac{\lambda _{\max }(3+2/L)+1}{\sqrt{n}}\Vert {\textbf{z}}_t\Vert _\infty + \frac{L}{\sqrt{n}}\Vert {\textbf{z}}_t\Vert _\infty \nonumber \\&\leqslant \frac{C_1}{\sqrt{n}} \Vert {\textbf{z}}_t\Vert _\infty , \end{aligned}$$

(A7)

where $C_1 = 6+2/\lambda _{\max }+L+4/L$.

From the inequality (A5), we have

$$\begin{aligned} \Vert \bar{{\textbf{u}}}_{t+1}-{\textbf{u}}^*\Vert _\lambda ^2&\leqslant \Vert \bar{{\textbf{u}}}_t-{\textbf{u}}^*\Vert _\lambda ^2 +2\eta _t \Vert \bar{{\textbf{u}}}_{t+1}-{\textbf{u}}^*\Vert _\lambda - \eta _t^2 - \xi _t^2, \end{aligned}$$

(A8)

which implies

$$\begin{aligned}&\Vert \bar{{\textbf{u}}}_{t+1}-{\textbf{u}}^*\Vert _\lambda \leqslant \Vert \bar{{\textbf{u}}}_{t}-{\textbf{u}}^*\Vert _\lambda +\eta _t\leqslant \Vert \bar{{\textbf{u}}}_{0}-{\textbf{u}}^*\Vert _\lambda + \sum _{s=0}^t \eta _s. \end{aligned}$$

(A9)

We denote $S_t= \sum _{s=0}^{t} C_1/\sqrt{n} \Vert {\textbf{z}}_s\Vert _\infty +\Vert \bar{{\textbf{u}}}_{0}-{\textbf{u}}^*\Vert _\lambda$, then the difference $\Vert \bar{{\textbf{u}}}_{t}-{\textbf{u}}^*\Vert _\lambda$ is bounded by $\Vert \bar{{\textbf{u}}}_{t}-{\textbf{u}}^*\Vert _\lambda \leqslant S_{t-1}$.

To bound the difference between nodes $\Vert {\textbf{z}}_s\Vert _\infty$ in $S_t$, according to Lemma 10, we have

$$\begin{aligned} \Vert {\textbf{z}}_{t+1}\Vert _\infty&\leqslant \alpha _t\Vert {\textbf{z}}_t\Vert _\infty + \alpha _t \sqrt{n} \left( \gamma L\Vert \bar{{\textbf{u}}}_{t}-{\textbf{u}}^*\Vert _\lambda + \sqrt{\lambda }\lambda _{\max }\Vert \bar{{\textbf{u}}}_{t+1}-{\textbf{u}}^*\Vert _\lambda \right) \nonumber \\&\leqslant \alpha _t\Vert {\textbf{z}}_t\Vert _\infty + \alpha _t \sqrt{n} \left( \gamma L+ \sqrt{\lambda }\lambda _{\max }\right) S_t. \end{aligned}$$

(A10)

Define $C_\xi = C_2\left( \gamma L+ \sqrt{\lambda }\lambda _{\max }\right)$. Then, inequality (A10) can be re-written as

$$\begin{aligned} S_{t+1}-S_t \leqslant \alpha _t (S_t-S_{t-1}) + \alpha _t C_\xi S_t, \end{aligned}$$

(A11)

where $\alpha _t$ depends on $K_t$ and is arbitrarily small. Take $K_t$ to be

$$\begin{aligned} K_t = \frac{\log ((t+1)^2/\delta )+\log \left( 7+6L+ 2\lambda _{\max } + 3\gamma (1+\lambda _{\max }) + 4\lambda \lambda _{\max } (1+L)\right) }{\log \left( 1/\chi \right) }, \end{aligned}$$

for any $\delta >0$, which means $\alpha _t = \delta /(t+1)^2$ in the inequality (A11) by Lemma 10.

Combining inequalities (A7) and (A11), we have that

$$\begin{aligned} S_{t+1}\leqslant (1+\alpha _t(1+C_\xi )) S_{t} \leqslant S_0 \prod _{s=0}^t \left( 1+ \frac{\delta (1+C_\xi )}{(s+1)^2}\right) \leqslant C_S<\infty . \end{aligned}$$

So $S_t= \sum _{s=0}^{t} C_1/\sqrt{n} \Vert {\textbf{z}}_s\Vert _\infty +\Vert \bar{{\textbf{u}}}_{0}-{\textbf{u}}^*\Vert _\lambda$ is bounded by some constant $C_S$ and thus $\Vert \bar{{\textbf{u}}}_{t}-{\textbf{u}}^*\Vert _\lambda \leqslant S_{t-1}\leqslant C_S$.

Summing the inequality (A8) over t from zero to infinity, we obtain

$$\begin{aligned} \sum _{t=0}^\infty \eta _t^2+\xi _t^2&\leqslant \Vert \bar{{\textbf{u}}}_0-{\textbf{u}}^*\Vert _\lambda ^2 +2 \sum _{t=0}^\infty \eta _t \Vert \bar{{\textbf{u}}}_{t+1}-{\textbf{u}}^*\Vert _\lambda \leqslant \Vert \bar{{\textbf{u}}}_0-{\textbf{u}}^*\Vert _\lambda ^2 + 2 C_S^2<\infty , \end{aligned}$$

which results in $\eta _t\rightarrow 0$ and $\xi _t \rightarrow 0$. It implies that

$$\begin{aligned} \Vert \nabla f(\bar{{\textbf{x}}}_t)-\nabla f({\textbf{x}}^*)\Vert _2&\rightarrow 0, \\ \Vert {\textbf{B}}^\top (\bar{{\textbf{v}}}_t-{\textbf{v}}^*)\Vert _2&\rightarrow 0 ,\\ \Vert T_1(\bar{{\textbf{u}}}_t)-\bar{{\textbf{v}}}_t\Vert _{\textbf{M}}&\rightarrow 0,\\ \varvec{\epsilon }_1^t,\varvec{\epsilon }_2^t&\rightarrow 0 . \end{aligned}$$

From the definition of ${\textbf{M}}$ semi-norm, it holds that

$$\begin{aligned}&\Vert \bar{{\textbf{v}}}_{t+1}-\bar{{\textbf{v}}}_t\Vert _2^2 = \Vert \bar{{\textbf{v}}}_{t+1}-\bar{{\textbf{v}}}_t\Vert _{\textbf{M}}^2 + \lambda \Vert {\textbf{B}}^\top \left( \bar{{\textbf{v}}}_{t+1}-\bar{{\textbf{v}}}_t\right) \Vert _2^2\\&\leqslant \Vert T_1(\bar{{\textbf{u}}}_t)-\bar{{\textbf{v}}}_t\Vert _{\textbf{M}}+ \Vert \varvec{\epsilon }_1^t\Vert _{\textbf{M}}+ \sqrt{\lambda } \Vert {\textbf{B}}^\top \left( \bar{{\textbf{v}}}_{t+1}-{\textbf{v}}^*\right) \Vert _2 + \sqrt{\lambda } \Vert {\textbf{B}}^\top \left( \bar{{\textbf{v}}}_{t}-{\textbf{v}}^*\right) \Vert _2\rightarrow 0. \end{aligned}$$

According to Proposition 8, the fixed point $({\textbf{v}}^*,{\textbf{x}}^*)$ satisfies

$$\begin{aligned} {\textbf{x}}^* = T_2({\textbf{v}}^*,{\textbf{x}}^*) = {\textbf{x}}^* - \gamma \nabla f({\textbf{x}}^*) - \lambda {\textbf{B}}^\top {\textbf{v}}^* , \end{aligned}$$

which implies $- \gamma \nabla f({\textbf{x}}^*) - \lambda {\textbf{B}}^\top {\textbf{v}}^* = 0$. From the iteration of $\bar{{\textbf{x}}}_t$(A3), we have that

$$\begin{aligned} \bar{{\textbf{x}}}_{t+1} - \bar{{\textbf{x}}}_t&= - \gamma \left( \nabla f(\bar{{\textbf{x}}}_t)-\nabla f({\textbf{x}}^*)\right) - \lambda {\textbf{B}}^\top \left( \bar{{\textbf{v}}}_t-{\textbf{v}}^*\right) + \gamma \varvec{\epsilon }_2^t + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1^t, \\ \Vert \bar{{\textbf{x}}}_{t+1} - \bar{{\textbf{x}}}_t\Vert _2&\leqslant \gamma \Vert \nabla f(\bar{{\textbf{x}}}_t)-\nabla f({\textbf{x}}^*)\Vert _2+ \lambda \Vert {\textbf{B}}^\top \left( \bar{{\textbf{v}}}_t-{\textbf{v}}^*\right) \Vert _2 + \gamma \Vert \varvec{\epsilon }_2^t\Vert _2 + \lambda \Vert {\textbf{B}}^\top \varvec{\epsilon }_1^t\Vert _2\rightarrow 0. \end{aligned}$$

Hence

$$\begin{aligned} \Vert \bar{{\textbf{u}}}_{t+1}-\bar{{\textbf{u}}}_t\Vert _\lambda \rightarrow 0. \end{aligned}$$

(A12)

Because $\Vert \bar{{\textbf{u}}}_t-{\textbf{u}}^*\Vert _\lambda$ is bounded, there exists a convergent subsequence $\{\bar{{\textbf{u}}}_{t_j}\}$ and $\bar{{\textbf{u}}}^*=(\bar{{\textbf{v}}}^*,\bar{{\textbf{x}}}^*)\in {\mathbb {R}}^q\times {\mathbb {R}}^p$ such that

$$\begin{aligned} \lim _{j\rightarrow \infty } \Vert \bar{{\textbf{u}}}_{t_j}-\bar{{\textbf{u}}}^*\Vert _2 = 0. \end{aligned}$$

(A13)

We now prove that $\bar{{\textbf{u}}}^*$ is a fixed point of $(T_1,T_2)$.

$$\begin{aligned}&\Vert \left( T_1(\bar{{\textbf{u}}}^*), T_2(\bar{{\textbf{u}}}^*)\right) - \bar{{\textbf{u}}}^*\Vert _\lambda \\&\leqslant \Vert \left( T_1(\bar{{\textbf{u}}}^*), T_2(\bar{{\textbf{u}}}^*)\right) - \left( T_1(\bar{{\textbf{u}}}_{t_j})+\varvec{\epsilon }_1^{t_j}, T_2(\bar{{\textbf{u}}}_{t_j})+\gamma \varvec{\epsilon }_2^{t_j} + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1^{t_j}\right) \Vert _\lambda + \Vert \bar{{\textbf{u}}}_{t_j+1}-\bar{{\textbf{u}}}^*\Vert _{\lambda }\\&\leqslant \Vert \bar{{\textbf{u}}}^*- \bar{{\textbf{u}}}_{t_j}\Vert _{\lambda } + \Vert (\varvec{\epsilon }_1^{t_j} , \gamma \varvec{\epsilon }_2^{t_j}+\lambda {\textbf{B}}^\top \varvec{\epsilon }_1^{t_j})\Vert _{\lambda } + \Vert \bar{{\textbf{u}}}_{t_{j+1}}-\bar{{\textbf{u}}}_{t_j}\Vert _{\lambda } + \Vert \bar{{\textbf{u}}}_{t_j}-\bar{{\textbf{u}}}^*\Vert _{\lambda }\rightarrow 0. \end{aligned}$$

The last inequality is due to the nonexpansive property of $(T_1,T_2)$ from Corollary 7.

So we have $\bar{{\textbf{u}}}^*=(\bar{{\textbf{v}}}^*,\bar{{\textbf{x}}}^*)$ is a fixed point of $(T_1,T_2)$. Moreover, note that inequality (A5) holds for all the fixed points. By choosing ${\textbf{u}}^* = \bar{{\textbf{u}}}^*$ and inequalities (A9) and (A12), we have

$$\begin{aligned} \lim _{t\rightarrow \infty } \Vert \bar{{\textbf{u}}}_t - \bar{{\textbf{u}}}^*\Vert _\lambda&\leqslant \lim _{t\rightarrow \infty } \left( \Vert \bar{{\textbf{u}}}_{t_j}-\bar{{\textbf{u}}}^*\Vert _\lambda + \sum _{s=t_j}^t \eta _s\right) = \Vert \bar{{\textbf{u}}}_{t_j}-\bar{{\textbf{u}}}^*\Vert _\lambda + \sum _{s=t_j}^\infty \eta _s, \quad \forall j. \end{aligned}$$

Let $j\rightarrow \infty$, then $\lim _{t\rightarrow \infty } \Vert \bar{{\textbf{u}}}_t - \bar{{\textbf{u}}}^*\Vert _\lambda =0$, which implies

$$\begin{aligned} \Vert \bar{{\textbf{v}}}_t-\bar{{\textbf{v}}}^*\Vert _2\rightarrow 0,\quad \Vert \bar{{\textbf{x}}}_t-\bar{{\textbf{x}}}^*\Vert _2\rightarrow 0. \end{aligned}$$

For $\bar{{\textbf{g}}}_t$, by iteration (A2), it holds that

$$\begin{aligned} \Vert \bar{{\textbf{g}}}_t-\bar{{\textbf{g}}}^*\Vert _2&= \Vert \nabla {\bar{F}} ({\textbf{X}}_t) - \nabla f (\bar{{\textbf{x}}}^*)\Vert _2 \leqslant \Vert \nabla {\bar{F}} ({\textbf{X}}_t) - \nabla f (\bar{{\textbf{x}}}_t)\Vert _2 + \Vert \nabla f (\bar{{\textbf{x}}}_t) - \nabla f(\bar{{\textbf{x}}}^*)\Vert _2 \nonumber \\&\leqslant \Vert \varvec{\epsilon }_2^t\Vert _2 + L \Vert \bar{{\textbf{x}}}_t-\bar{{\textbf{x}}}^*\Vert _2 \rightarrow 0. \end{aligned}$$

(A14)

From Proposition 8, we can conclude that $\bar{{\textbf{x}}}^*$ is the solution to problem (1).

The remainder of this section demonstrates that $({\textbf{v}}_i(t), {\textbf{x}}_i(t),{\textbf{g}}_i(t))$ converges to $(\bar{{\textbf{v}}}_t, \bar{{\textbf{x}}}_t,\bar{{\textbf{g}}}_t)$ for all $i\in {\mathcal {V}}$. It holds that $S_t= \sum _{s=0}^{t} C_1/\sqrt{n}\Vert {\textbf{z}}_s\Vert _\infty +\Vert \bar{{\textbf{u}}}_{0}-{\textbf{u}}^*\Vert _\lambda$ is bounded, then,

$$\begin{aligned} \max \left\{ \Vert {\textbf{v}}_i(t)-\bar{{\textbf{v}}}_t\Vert _2, \Vert {\textbf{x}}_i(t)-\bar{{\textbf{x}}}_t\Vert _2,\Vert {\textbf{g}}_i(t)-\bar{{\textbf{g}}}_t\Vert _2\right\} \leqslant \Vert {\textbf{z}}_t\Vert _\infty \rightarrow 0, \end{aligned}$$

for all $i\in {\mathcal {V}}$.

Then for all $i\in {\mathcal {V}}$,

$$\begin{aligned} \Vert {\textbf{v}}_i(t)-{\textbf{v}}^*\Vert _2&\leqslant \Vert {\textbf{v}}_i(t)-\bar{{\textbf{v}}}_t\Vert _2 + \Vert \bar{{\textbf{v}}}_t-\bar{{\textbf{v}}}^*\Vert _2\rightarrow 0,\\ \Vert {\textbf{x}}_i(t)-{\textbf{x}}^*\Vert _2&\leqslant \Vert {\textbf{x}}_i(t)-\bar{{\textbf{x}}}_t\Vert _2 + \Vert \bar{{\textbf{x}}}_t-\bar{{\textbf{x}}}^*\Vert _2\rightarrow 0,\\ \Vert {\textbf{g}}_i(t)-{\textbf{g}}^*\Vert _2&\leqslant \Vert {\textbf{g}}_i(t)-\bar{{\textbf{g}}}_t\Vert _2 + \Vert \bar{{\textbf{g}}}_t-\bar{{\textbf{g}}}^*\Vert _2\rightarrow 0. \end{aligned}$$

Therefore $({\textbf{v}}_i(t), {\textbf{v}}_i(t),{\textbf{g}}_i(t))$ converges to $(\bar{{\textbf{v}}}^*,\bar{{\textbf{x}}}^*,\bar{{\textbf{g}}}^*)$ for all $i\in {\mathcal {V}}$ and $\bar{{\textbf{x}}}^*$ is a solution to problem (1). $\square$

1.1 A.1. Proof of Lemma 9

Proof

From the L-smoothness of $f_i$, we have

$$\begin{aligned} \Vert \varvec{\epsilon }_2^t\Vert _2=\Vert \nabla f(\bar{{\textbf{x}}}_t)-\nabla {\bar{F}}({\textbf{X}}_t)\Vert _2 \leqslant \frac{1}{\sqrt{n}}\Vert \nabla F({\textbf{1}}\bar{{\textbf{x}}}_t^\top ) - \nabla F({\textbf{X}}_t)\Vert _2\leqslant \frac{L}{\sqrt{n}}\Vert {\textbf{z}}_t\Vert _\infty . \end{aligned}$$

Define

$$\begin{aligned} {\hat{T}}_1(\bar{{\textbf{v}}}_t,\bar{{\textbf{x}}}_t,\bar{{\textbf{g}}}_t) = \left( {\textbf{I}}_q-{\text {prox}}_{\frac{\gamma }{\lambda } h}\right) \left( {\textbf{B}}\left( \bar{{\textbf{x}}}_t-\gamma \bar{{\textbf{g}}}_t\right) + \left( {\textbf{I}}_q-\lambda {\textbf{B}}{\textbf{B}}^\top \right) \bar{{\textbf{v}}}_t\right) . \end{aligned}$$

(A15)

From Lemma 13 and Corollary 7, it holds that

$$\begin{aligned} \Vert \varvec{\epsilon }_1^t\Vert _2&=\Vert \bar{{\textbf{T}}}_1({\textbf{V}}_t, {\textbf{X}}_t) - {\hat{T}}_1(\bar{{\textbf{v}}}_t,\bar{{\textbf{x}}}_t,\bar{{\textbf{g}}}_t)\Vert _2 \leqslant \frac{1}{\sqrt{n}} \Vert T_1({\textbf{V}}_t,{\textbf{X}}_t)-{\textbf{1}}{\hat{T}}_1(\bar{{\textbf{v}}}_t,\bar{{\textbf{x}}}_t,\bar{{\textbf{g}}}_t)^\top \Vert _2\nonumber \\&\leqslant \frac{1}{\sqrt{n}}\left( \lambda _{\max }\Vert {\textbf{X}}_t-{\textbf{1}}\bar{{\textbf{x}}}_t^\top \Vert _2 + \gamma \lambda _{\max } \Vert {\textbf{G}}_t-{\textbf{1}}\bar{{\textbf{g}}}_t^\top \Vert _2+ \gamma \lambda _{\max } L \Vert {\textbf{X}}_t- {\textbf{1}}\bar{{\textbf{x}}}_t^\top \Vert _2+\Vert {\textbf{V}}_t-{\textbf{1}}\bar{{\textbf{v}}}_t^\top \Vert _2\right) \nonumber \\&\leqslant \frac{\lambda _{\max }+ 2\lambda _{\max }/L + 1+2\lambda _{\max } }{\sqrt{n}} \Vert {\textbf{z}}_t\Vert _\infty \leqslant \frac{\lambda _{\max }(3+2/L)+1}{\sqrt{n}} \Vert {\textbf{z}}_t\Vert _\infty . \end{aligned}$$

(A16)

$\square$

1.2 A.2. Proof of Theorem 11

Proposition 12

For any two elements $\bar{{\textbf{u}}}_1 = (\bar{{\textbf{v}}}_1,\bar{{\textbf{x}}}_1),\bar{{\textbf{u}}}_2 = (\bar{{\textbf{v}}}_2,\bar{{\textbf{x}}}_2)\in {\mathbb {R}}^q\times {\mathbb {R}}^p$, we have that

(ii) (Theorem 3.3 of Chen et al. (2013))

$$\begin{aligned}&\Vert \left( T_1(\bar{{\textbf{u}}}_1), T_2(\bar{{\textbf{u}}}_1)\right) - \left( T_1(\bar{{\textbf{u}}}_2), T_2(\bar{{\textbf{u}}}_2)\right) \Vert _\lambda ^2\nonumber \\&\leqslant \Vert \bar{{\textbf{u}}}_1-\bar{{\textbf{u}}}_2\Vert _\lambda ^2- \gamma \left( \frac{2}{L}-\gamma \right) \Vert \nabla f(\bar{{\textbf{x}}}_1)-\nabla f(\bar{{\textbf{x}}}_2)\Vert ^2\nonumber \\&- \Vert \lambda {\textbf{B}}^\top (\bar{{\textbf{v}}}_1-\bar{{\textbf{v}}}_2)\Vert ^2 - \lambda \Vert \left( T_1(\bar{{\textbf{u}}}_1)-T_1(\bar{{\textbf{u}}}_2)\right) -(\bar{{\textbf{v}}}_1-\bar{{\textbf{v}}}_2)\Vert _{\textbf{M}}^2. \end{aligned}$$

(A17)

(iii) (Chen et al., 2013) If f is $\mu$-strongly convex,

$$\begin{aligned}&\Vert \left( T_1(\bar{{\textbf{u}}}_1), T_2(\bar{{\textbf{u}}}_1)\right) - \left( T_1(\bar{{\textbf{u}}}_2), T_2(\bar{{\textbf{u}}}_2)\right) \Vert _\lambda ^2\nonumber \\&\leqslant \Vert \bar{{\textbf{u}}}_1-\bar{{\textbf{u}}}_2\Vert _\lambda ^2- \gamma \mu (2-\gamma L) \Vert \bar{{\textbf{x}}}_1-\bar{{\textbf{x}}}_2\Vert ^2 \nonumber \\&- \Vert \lambda {\textbf{B}}^\top (\bar{{\textbf{v}}}_1-\bar{{\textbf{v}}}_2)\Vert ^2 - \lambda \Vert \left( T_1(\bar{{\textbf{u}}}_1)-T_1(\bar{{\textbf{u}}}_2)\right) -(\bar{{\textbf{v}}}_1-\bar{{\textbf{v}}}_2)\Vert _{\textbf{M}}^2. \end{aligned}$$

(A18)

Proof of Theorem 11

(i) For simplicity we denote $\varvec{\epsilon }_i(u_1)$ by $\varvec{\epsilon }_i, i=1,2$. Then, by the iteration (A3), we have that

$$\begin{aligned} \Vert T_1(\bar{{\textbf{u}}}_1)+\varvec{\epsilon }_1-T_1(\bar{{\textbf{u}}}_2)\Vert _2^2&\leqslant \Vert T_1(\bar{{\textbf{u}}}_1)-T_1(\bar{{\textbf{u}}}_2)\Vert _2^2 + 2\Vert \varvec{\epsilon }_1\Vert _2 \Vert T_1(\bar{{\textbf{u}}}_1)+\varvec{\epsilon }_1-T_1(\bar{{\textbf{u}}}_2)\Vert _2 -\Vert \varvec{\epsilon }_1\Vert _2^2. \end{aligned}$$

$$\begin{aligned}&\Vert T_2(\bar{{\textbf{u}}}_1)+\gamma \varvec{\epsilon }_2 + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1 - T_2(\bar{{\textbf{u}}}_2)\Vert _2^2 \leqslant \Vert T_2(\bar{{\textbf{u}}}_1) - T_2(\bar{{\textbf{u}}}_2)\Vert _2^2 \\&\quad + 2\Vert \gamma \varvec{\epsilon }_2 + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1\Vert _2\Vert T_2(\bar{{\textbf{u}}}_1)+\gamma \varvec{\epsilon }_2 + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1 - T_2(\bar{{\textbf{u}}}_2)\Vert _2-\Vert \gamma \varvec{\epsilon }_2 + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1\Vert _2^2. \end{aligned}$$

So we have

$$\begin{aligned}&\Vert \left( T_1(\bar{{\textbf{u}}}_1)+\varvec{\epsilon }_1, T_2(\bar{{\textbf{u}}}_1)+\gamma \varvec{\epsilon }_2 + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1\right) - \left( T_1(\bar{{\textbf{u}}}_2), T_2(\bar{{\textbf{u}}}_2)\right) \Vert _\lambda ^2 \\&= \Vert T_2(\bar{{\textbf{u}}}_1)+\gamma \varvec{\epsilon }_2 + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1 - T_2(\bar{{\textbf{u}}}_2)\Vert _2^2 + \lambda \Vert T_1(\bar{{\textbf{u}}}_1)+\varvec{\epsilon }_1-T_1(\bar{{\textbf{u}}}_2)\Vert _2^2\\&\leqslant \Vert \left( T_1(\bar{{\textbf{u}}}_1), T_2(\bar{{\textbf{u}}}_1)\right) - \left( T_1(\bar{{\textbf{u}}}_2), T_2(\bar{{\textbf{u}}}_2)\right) \Vert _\lambda ^2 \\&+ 2\Vert \gamma \varvec{\epsilon }_2 + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1\Vert _2\Vert T_2(\bar{{\textbf{u}}}_1)+\gamma \varvec{\epsilon }_2 + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1 - T_2(\bar{{\textbf{u}}}_2)\Vert _2-\Vert \gamma \varvec{\epsilon }_2 + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1\Vert _2^2\\&+ 2\lambda \Vert \varvec{\epsilon }_1\Vert _2 \Vert T_1(\bar{{\textbf{u}}}_1)+\varvec{\epsilon }_1-T_1(\bar{{\textbf{u}}}_2)\Vert _2 - \lambda \Vert \varvec{\epsilon }_1\Vert _2^2. \end{aligned}$$

$\square$

Appendix B: Proof of Theorem 4

Before completing the proof of Theorem 4, we control for the differences between ${\textbf{V}}_t,{\textbf{X}}_t,{\textbf{G}}_t$ and their average $\bar{{\textbf{v}}}_t,\bar{{\textbf{x}}}_t,\bar{{\textbf{g}}}_t$.

1.1 B.1. Proof of Lemma 10

Proof

From Algorithm 1 and the definition of the $\chi$-average consensus algorithm, for $\Vert {\textbf{V}}_{t}- {\textbf{1}}\bar{{\textbf{v}}}_{t}^\top \Vert _2$, we have that

$$\begin{aligned}&\chi ^{-K}\left\| {\textbf{V}}_{t+1}- {\textbf{1}}\bar{{\textbf{v}}}_{t+1}^\top \right\| _2\leqslant \left\| {\textbf{T}}_1({\textbf{V}}_t,{\textbf{X}}_t)-{\textbf{1}}\bar{{\textbf{T}}}_1({\textbf{V}}_t,{\textbf{X}}_t)^\top \right\| _2 \nonumber \\&\leqslant 2 \left\| {\textbf{T}}_1({\textbf{V}}_t,{\textbf{X}}_t)-{\textbf{1}}{\hat{T}}_1(\bar{{\textbf{v}}}_t,\bar{{\textbf{x}}}_t,\bar{{\textbf{g}}}_t)^\top \right\| _2 \nonumber \\&\leqslant 2\left\| \left( {\textbf{X}}_t-{\textbf{1}}\bar{{\textbf{x}}}_t^\top -\gamma {\textbf{G}}_t+\gamma {\textbf{1}}\bar{{\textbf{g}}}_t^\top \right) {\textbf{B}}^\top + \left( {\textbf{V}}_t-{\textbf{1}}\bar{{\textbf{v}}}_t^\top \right) \left( {\textbf{I}}_q-\lambda {\textbf{B}}{\textbf{B}}^\top \right) \right\| _2\nonumber \\&\leqslant 2\left( \lambda _{\max }\left\| {\textbf{X}}_t-{\textbf{1}}\bar{{\textbf{x}}}_t^\top \right\| _2+ \gamma \lambda _{\max } \left\| {\textbf{G}}_t-{\textbf{1}}\bar{{\textbf{x}}}_t^\top \right\| _2+\left\| {\textbf{V}}_t-{\textbf{1}}\bar{{\textbf{v}}}_t^\top \right\| _2\right) . \end{aligned}$$

(B19)

The third inequality is due to Lemma 13 and the last inequality is due to Corollary 7.

Similarly, for $\Vert {\textbf{X}}_t-{\textbf{1}}\bar{{\textbf{x}}}_t^\top \Vert _2$ under the conditions that $0<\gamma<2/L, 0<\lambda \leqslant 1/\lambda _{\max }^2$, we have

$$\begin{aligned}&\chi ^{-K}\left\| {\textbf{X}}_{t+1}-{\textbf{1}}\bar{{\textbf{x}}}_{t+1}^\top \right\| _2 \leqslant \left\| {\textbf{X}}_t-{\textbf{1}}\bar{{\textbf{x}}}_t^\top - \gamma \left( {\textbf{G}}_t- {\textbf{1}}\bar{{\textbf{g}}}_t^\top \right) -\lambda \left( {\textbf{V}}_{t+1}-{\textbf{1}}\bar{{\textbf{v}}}_{t+1}^\top \right) {\textbf{B}}\right\| _2 \nonumber \\&\leqslant (1+2 \lambda \lambda _{\max }^2) \left\| {\textbf{X}}_t- {\textbf{1}}\bar{{\textbf{x}}}_t^\top \right\| _2 + \gamma (1+2 \lambda \lambda _{\max }^2) \left\| {\textbf{G}}_t-{\textbf{1}}\bar{{\textbf{g}}}_t^\top \right\| _2 + 2\lambda \lambda _{\max } \left\| {\textbf{V}}_t-{\textbf{1}}\bar{{\textbf{v}}}_t^\top \right\| _2 \nonumber \\&\leqslant 3 \left\| {\textbf{X}}_t- {\textbf{1}}\bar{{\textbf{x}}}_t^\top \right\| _2 + 3 \gamma \left\| {\textbf{G}}_t-{\textbf{1}}\bar{{\textbf{g}}}_t^\top \right\| _2+ 2\lambda \lambda _{\max } \left\| {\textbf{V}}_t-{\textbf{1}}\bar{{\textbf{v}}}_t^\top \right\| _2. \end{aligned}$$

(B20)

The fourth inequality is due to inequality (B19). For $\Vert {\textbf{G}}_t-{\textbf{1}}\bar{{\textbf{g}}}_{t}^\top \Vert _2$, we have

$$\begin{aligned}&\chi ^{-K}\left\| {\textbf{G}}_{t+1}-{\textbf{1}}\bar{{\textbf{g}}}_{t+1}^\top \right\| _2\leqslant \left\| {\textbf{G}}_t - {\textbf{1}}\bar{{\textbf{g}}}_t^\top + \nabla F({\textbf{X}}_{t+1})-{\textbf{1}}\nabla {\bar{F}}({\textbf{X}}_{t+1})- \nabla F({\textbf{X}}_{t})+{\textbf{1}}\nabla {\bar{F}}({\textbf{X}}_{t})\right\| _2\nonumber \\&\leqslant \left\| {\textbf{G}}_t-{\textbf{1}}\bar{{\textbf{g}}}_t^\top \right\| _2 + \left\| \nabla F({\textbf{X}}_{t+1}) -\nabla F({\textbf{X}}_{t}) \right\| _2 \nonumber \\&\leqslant \left\| {\textbf{G}}_t-{\textbf{1}}\bar{{\textbf{g}}}_t^\top \right\| _2 + L \left\| {\textbf{X}}_{t+1}-{\textbf{1}}\bar{{\textbf{x}}}_{t+1}^\top \right\| _2 + L \left\| {\textbf{X}}_t-{\textbf{1}}\bar{{\textbf{x}}}_t^\top \right\| _2+ L \sqrt{n}\left\| \bar{{\textbf{x}}}_{t+1}-\bar{{\textbf{x}}}_{t}\right\| _2 \nonumber \\&\leqslant 4 L \left\| {\textbf{X}}_t-{\textbf{1}}\bar{{\textbf{x}}}_t^\top \right\| _2 + 4L\lambda \lambda _{\max }\left\| {\textbf{V}}_t-{\textbf{1}}\bar{{\textbf{v}}}_t^\top \right\| _2+7 \left\| {\textbf{G}}_t- {\textbf{1}}\bar{{\textbf{g}}}_t^\top \right\| _2 + L \sqrt{n}\left\| \bar{{\textbf{x}}}_{t+1}-\bar{{\textbf{x}}}_{t}\right\| _2. \end{aligned}$$

(B21)

The second inequality is owing to $\Vert {\textbf{X}}-{\textbf{1}}\bar{{\textbf{x}}}^\top \Vert _2=\Vert \left( {\textbf{I}}_n-\frac{1}{n}{\textbf{1}}{\textbf{1}}^\top \right) {\textbf{X}}\Vert _2\leqslant \Vert {\textbf{X}}\Vert _2$.

According to Proposition 8, the fixed point $({\textbf{v}}^*,{\textbf{x}}^*)$ satisfies

$$\begin{aligned} {\textbf{x}}^* = T_2({\textbf{v}}^*,{\textbf{x}}^*) = {\textbf{x}}^* - \gamma \nabla f({\textbf{x}}^*)- \lambda {\textbf{B}}^\top {\textbf{v}}^* , \end{aligned}$$

This implies $- \gamma \nabla f({\textbf{x}}^*) - \lambda {\textbf{B}}^\top {\textbf{v}}^* = 0$. Subsequently, for $\Vert \bar{{\textbf{x}}}_{t+1}-\bar{{\textbf{x}}}_t\Vert _2$,

$$\begin{aligned}&\left\| \bar{{\textbf{x}}}_{t+1}-\bar{{\textbf{x}}}_t\right\| _2 = \left\| \gamma \bar{{\textbf{g}}}_t-\gamma \nabla f(\bar{{\textbf{x}}}_t)+\gamma \nabla f(\bar{{\textbf{x}}}_t)-\gamma \nabla f({\textbf{x}}^*) + \lambda {\textbf{B}}^\top (\bar{{\textbf{v}}}_{t+1}-{\textbf{v}}^*)\right\| _2\\&\leqslant \frac{\gamma L}{\sqrt{n}} \left\| {\textbf{X}}_t- {\textbf{1}}\bar{{\textbf{x}}}_t^\top \right\| _2 + \gamma L \left\| \bar{{\textbf{x}}}_t-{\textbf{x}}^*\right\| _2+ \lambda \lambda _{\max }\left\| \bar{{\textbf{v}}}_{t+1}-{\textbf{v}}^*\right\| _2. \end{aligned}$$

Combining inequalities (B19), (B20), and (B21) and the definition of ${\textbf{z}}_t$, we have

$$\begin{aligned} {\textbf{z}}_{t+1}&\leqslant \chi ^K {\varvec{A}}{\textbf{z}}_t + \chi ^K L\sqrt{n}\left[ 0,0, \gamma L \left\| \bar{{\textbf{x}}}_t-{\textbf{x}}^*\right\| _2+ \lambda \lambda _{\max }\left\| \bar{{\textbf{v}}}_{t+1}-{\textbf{v}}^*\right\| _2\right] ^\top . \end{aligned}$$

Under the conditions $0<\gamma<2/L, 0<\lambda \leqslant 1/\lambda _{\max }^2$, it holds that

$$\begin{aligned} {\varvec{A}}= \left( \begin{array}{ccc} 2&{} 2 \lambda _{\max } &{} 2 \gamma \lambda _{\max } \\ 2\lambda \lambda _{\max } &{} 3 &{} 3 \gamma \\ 4L \lambda \lambda _{\max } &{} 6L &{} 7 \end{array} \right) . \end{aligned}$$

Then $\Vert {\varvec{A}}\Vert _\infty \leqslant 7 + 6\,L+ 2\lambda _{\max } + 3\gamma (1+\lambda _{\max }) + 4\lambda \lambda _{\max } (1+L)$.

Therefore, with

$$\begin{aligned} K_t = \frac{\log (1/\alpha _t)+\log \left( 7 + 6L+ 2\lambda _{\max } + 3\gamma (1+\lambda _{\max }) + 4\lambda \lambda _{\max } (1+L)\right) }{\log \left( 1/\chi \right) }, \end{aligned}$$

for any $\alpha _t>0$, we have that

$$\begin{aligned} \left\| {\textbf{z}}_{t+1}\right\| _\infty&\leqslant \alpha _t\left\| {\textbf{z}}_t\right\| _\infty + \alpha _t \sqrt{n}\left( \gamma L \left\| \bar{{\textbf{x}}}_t-{\textbf{x}}^*\right\| _2+ \lambda \lambda _{\max }\left\| \bar{{\textbf{v}}}_{t+1}-{\textbf{v}}^*\right\| _2\right) . \end{aligned}$$

$\square$

1.2 B.2. Proof of Theorem 4

Proof

Suppose $f=1/n\sum _i f_i$ is $\mu$-strongly convex. Let ${\textbf{u}}^*=({\textbf{v}}^*,{\textbf{x}}^*)$ be a fixed point of $(T_1,T_2)$. From (A4) and (A18) in Theorem 11, it follows that

$$\begin{aligned} \left\| \bar{{\textbf{u}}}_{t+1}-{\textbf{u}}^*\right\| _\lambda ^2&\leqslant \left\| \bar{{\textbf{u}}}_t-{\textbf{u}}^*\right\| _\lambda ^2 - \mu \gamma (2-\gamma L)\left\| \bar{{\textbf{x}}}_t-{\textbf{x}}^*\right\| _2^2 -\lambda ^2\lambda _{\min }^2 \left\| \bar{{\textbf{v}}}_t-{\textbf{v}}^*\right\| _2^2 - \lambda \left\| T_1(\bar{{\textbf{u}}}_t)-\bar{{\textbf{v}}}_t\right\| _{\textbf{M}}^2 \nonumber \\&+ 2\left\| \gamma \varvec{\epsilon }_2^t + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1^t \right\| _2\left\| \bar{{\textbf{x}}}_{t+1} - {\textbf{x}}^*\right\| _2-\left\| \gamma \varvec{\epsilon }_2^t + \lambda {\textbf{B}}^\top \varvec{\epsilon }_1^t \right\| _2^2 + 2\lambda \left\| \varvec{\epsilon }_1^t\right\| _2 \left\| \bar{{\textbf{v}}}_{t+1}-{\textbf{v}}^*\right\| _2 - \lambda \left\| \varvec{\epsilon }_1^t\right\| _2^2\nonumber \\&\leqslant \delta ^2 \left\| \bar{{\textbf{u}}}_t-{\textbf{u}}^*\right\| _\lambda ^2 + 2\eta _t \left\| \bar{{\textbf{u}}}_{t+1}-{\textbf{u}}^*\right\| _\lambda - \eta _t^2, \end{aligned}$$

(B22)

where $\delta =\sqrt{1-\min \{\mu \gamma (2-\gamma L), \lambda \lambda _{\min }^2\}}<1$ and $\eta _t$ is defined by (A6). Then

$$\begin{aligned} \left\| \bar{{\textbf{u}}}_{t+1}-{\textbf{u}}^*\right\| _\lambda&\leqslant \delta \left\| \bar{{\textbf{u}}}_t-{\textbf{u}}^*\right\| _\lambda + \eta _t. \end{aligned}$$

(B23)

From inequalities (B22) and (A7), we have that

$$\begin{aligned} \left\| \bar{{\textbf{u}}}_{t+1}-{\textbf{u}}^*\right\| _\lambda ^2&\leqslant \delta ^2 \left\| \bar{{\textbf{u}}}_t-{\textbf{u}}^*\right\| _\lambda ^2 + 2\eta _t \left\| \bar{{\textbf{u}}}_{t}-{\textbf{u}}^*\right\| _\lambda + \eta _t^2\\&\leqslant \delta ^2 \left\| \bar{{\textbf{u}}}_t-{\textbf{u}}^*\right\| _\lambda ^2 + 2 C_1/\sqrt{n} \left\| {\textbf{z}}_t\right\| _\infty \left\| \bar{{\textbf{u}}}_{t}-{\textbf{u}}^*\right\| _\lambda + C_1^2/n \left\| {\textbf{z}}_t\right\| _\infty ^2. \end{aligned}$$

Then, we prove by induction that

$$\begin{aligned} \left\| \bar{{\textbf{u}}}_{t}-{\textbf{u}}^*\right\| _\lambda ^2\leqslant \psi ^t \left( \left\| \bar{{\textbf{u}}}_{0}-{\textbf{u}}^*\right\| _\lambda ^2 + C_2^2/n \left\| {\textbf{z}}_0\right\| _\infty ^2\right) , \end{aligned}$$

(B24)

for any $t\geqslant 0$, where $\psi \ge (1+\delta ^2)/2$ and $C_2^2 = \frac{8}{1 -\delta ^2} C_1^2$.

When $t=0$, it holds that

$$\begin{aligned} \left\| \bar{{\textbf{u}}}_{1}-{\textbf{u}}^*\right\| _\lambda ^2&\leqslant \delta ^2 \left\| \bar{{\textbf{u}}}_{0}-{\textbf{u}}^*\right\| _\lambda ^2 + (\psi -\delta ^2)\left\| \bar{{\textbf{u}}}_{0}-{\textbf{u}}^*\right\| _\lambda ^2 + \frac{C_1^2}{n(\psi -\delta ^2)} \left\| {\textbf{z}}_0\right\| _\infty ^2 + C_1^2 \left\| {\textbf{z}}_0\right\| _\infty ^2/n\\&\leqslant \psi \left( \left\| \bar{{\textbf{u}}}_{0}-{\textbf{u}}^*\right\| _\lambda ^2 + C_2^2/n \left\| {\textbf{z}}_0\right\| _\infty ^2\right) . \end{aligned}$$

Therefore, the inequality (B24) holds for $t = 0$. Next, we assume that, for $s=0,\cdots ,t$, the inequality (B24) holds. We then prove that the inequality (B24) holds for $t+1$.

From inequalities (A10)(B23), under the conditions that $\gamma <2/L$ and $\lambda \leqslant 1/\lambda _{\max }^2$, we have

$$\begin{aligned} \left\| {\textbf{z}}_{t+1}\right\| _\infty&\leqslant \alpha _t\left\| {\textbf{z}}_t\right\| _\infty + \alpha _t \sqrt{n} \left( \gamma L\left\| \bar{{\textbf{u}}}_{t}-{\textbf{u}}^*\right\| _\lambda + \sqrt{\lambda }\lambda _{\max }\left\| \bar{{\textbf{u}}}_{t+1}-{\textbf{u}}^*\right\| _\lambda \right) \\&\leqslant \alpha _t(1+\sqrt{\lambda }\lambda _{\max }C_1)\left\| {\textbf{z}}_t\right\| _\infty + \alpha _t \sqrt{n} \left( \gamma L+ \sqrt{\lambda }\lambda _{\max }\right) \left\| \bar{{\textbf{u}}}_{t}-{\textbf{u}}^*\right\| _\lambda \\&\leqslant \alpha _t(1+C_1)\left\| {\textbf{z}}_t\right\| _\infty +3 \alpha _t \sqrt{n} \left\| \bar{{\textbf{u}}}_{t}-{\textbf{u}}^*\right\| _\lambda . \end{aligned}$$

Let $\alpha _t = \alpha$ satisfy $\alpha (1+C_1)\leqslant 1/2 \psi ^{1/2}\leqslant 1/2$. Then, for $t\geqslant 1$

$$\begin{aligned} \left\| {\textbf{z}}_{t}\right\| _\infty&\leqslant \alpha ^{t} (1+C_1)^{t} \left\| {\textbf{z}}_0\right\| _\infty + 3\alpha \sqrt{n}\sum _{s=0}^{t-1} \alpha ^{t-1-s} (1+C_1)^{t-1-s} \left\| \bar{{\textbf{u}}}_{s}-{\textbf{u}}^*\right\| _\lambda \\&\leqslant 2\alpha (1+C_1) \psi ^{t/2} \left\| {\textbf{z}}_0\right\| _\infty \\&\quad + 3\alpha \sqrt{n} \frac{\psi ^{t/2}}{\psi ^{1/2} - \alpha (1+C_1)}\left( \left\| \bar{{\textbf{u}}}_{0}-{\textbf{u}}^*\right\| _\lambda + \frac{C_2}{\sqrt{n}}\left\| {\textbf{z}}_0\right\| _\infty \right) \\&\leqslant \psi ^{t/2}\alpha \sqrt{n} C_3\left( \left\| \bar{{\textbf{u}}}_{0}-{\textbf{u}}^*\right\| _\lambda +\frac{C_2}{\sqrt{n}} \left\| {\textbf{z}}_0\right\| _\infty \right) . \end{aligned}$$

where $C_3= 2(1+C_1)/C_2 + 12$. Then

$$\begin{aligned} \left\| {\textbf{z}}_t\right\| _\infty \left\| \bar{{\textbf{u}}}_{t}-{\textbf{u}}^*\right\| _\lambda \leqslant \psi ^{t}2\alpha \sqrt{n} C_3 \left( \left\| \bar{{\textbf{u}}}_{0}-{\textbf{u}}^*\right\| _\lambda ^2 + C_2^2/n \left\| {\textbf{z}}_0\right\| _\infty ^2\right) . \end{aligned}$$

If $\alpha < \frac{\psi -\delta ^2}{2\left( 2C_1C_3+C_1^2C_3\right) }$ and $\alpha <1$, we obtain

$$\begin{aligned} \left\| \bar{{\textbf{u}}}_{t+1}-{\textbf{u}}^*\right\| _\lambda ^2&\leqslant \delta ^2 \psi ^t \left( \left\| \bar{{\textbf{u}}}_{0}-{\textbf{u}}^*\right\| _\lambda ^2 + C_2^2/n \left\| {\textbf{z}}_0\right\| _\infty ^2\right) \\&\quad + 4\alpha C_1C_3\psi ^{t} \left( \left\| \bar{{\textbf{u}}}_{0}-{\textbf{u}}^*\right\| _\lambda ^2 + C_2^2/n \left\| {\textbf{z}}_0\right\| _\infty ^2\right) \\&\quad + \alpha ^2 C_1^2 C_3\psi ^{t}\left( \left\| \bar{{\textbf{u}}}_{0}-{\textbf{u}}^*\right\| _\lambda +\frac{C_2}{\sqrt{n}} \left\| {\textbf{z}}_0\right\| _\infty \right) ^2\\&\leqslant \psi ^{t+1} \left( \left\| \bar{{\textbf{u}}}_{0}-{\textbf{u}}^*\right\| _\lambda ^2 + C_2^2/n \left\| {\textbf{z}}_0\right\| _\infty ^2\right) . \end{aligned}$$

Thus, we prove the inequality (B24) using induction. For $\alpha$, it holds that

$$\begin{aligned} \alpha < \min \{\frac{\psi -\delta ^2}{24 C_1\left( 2+C_1\right) }, \frac{\psi ^{1/2}}{2(1+C_1)}\}. \end{aligned}$$

With $\psi \ge 1/2$, let

$$\begin{aligned} \alpha< \frac{\psi -\delta ^2}{24 \left( 1+C_1\right) ^2} < \min \{\frac{\psi -\delta ^2}{24 C_1\left( 2+C_1\right) }, \frac{\psi ^{1/2}}{2(1+C_1)}\}, \end{aligned}$$

then we have

$$\begin{aligned} \psi = \max \left\{ \delta ^2 + 24(1+C_1)^2\alpha , \frac{1+\delta ^2}{2}\right\} . \end{aligned}$$

Choose small $\alpha \le \frac{1-\delta ^2}{48(1+C_1)^2}$ such that $\psi =\frac{1+\delta ^2}{2}$. To achieve $\Vert {\textbf{X}}_T-{\textbf{1}}{\textbf{x}}^{*^\top }\Vert _2^2={\mathcal {O}}(\epsilon )$, let $\gamma =1/L$ and $\lambda =\min \{1/\lambda _{\max }^2, \mu /(L\lambda _{\min }^2)\}$. The computational complexity T is given by:

$$\begin{aligned} T= {\mathcal {O}}(\max \{L/\mu , \lambda _{\max }^2/\lambda _{\min }^2\}\log (\epsilon ^{-1})). \end{aligned}$$

Since

$$\begin{aligned} \left( \frac{1+\delta ^2}{2}\right) ^{T}&= \left( 1 - \frac{1}{2}\min \{\mu /L, \lambda _{\min }^2/\lambda _{\max }^2\}\right) ^T\\&\leqslant \exp \left( -\frac{T}{2} \min \{\mu /L, \lambda _{\min }^2/\lambda _{\max }^2\}\right) ={\mathcal {O}}(\epsilon ), \end{aligned}$$

we have

$$\begin{aligned} \left\| {\textbf{X}}_T-{\textbf{1}}{\textbf{x}}^{*^\top }\right\| _2^2&\leqslant 2\left( \left\| {\textbf{X}}_T-{\textbf{1}}\bar{{\textbf{x}}}_T^{\top }\right\| _2^2+ \left\| {\textbf{1}}\bar{{\textbf{x}}}_T^{\top }-{\textbf{1}}{\textbf{x}}^{*^\top }\right\| _2^2\right) \\&\leqslant 2\left( \left\| {\textbf{z}}_T\right\| _\infty ^2 + n\left\| {\textbf{u}}_T-{\textbf{u}}^*\right\| _\lambda ^2\right) = {\mathcal {O}}(\epsilon ). \end{aligned}$$

With $\alpha =(1-\delta ^2)/(48\left( 1+C_1\right) ^2)$ and K in Lemma 10 satisfying

$$\begin{aligned} K = {\mathcal {O}}\left( \frac{1}{1-\chi } \log \frac{g_1(L,\mu ,\lambda _{\max },\lambda _{\min })}{g_2(L,\mu ,\lambda _{\max },\lambda _{\min })}\right) , \end{aligned}$$

where $g_1,g_2$ are polynomials of $L,\mu ,\lambda _{\max },\lambda _{\min }$ and the formula is

$$\begin{aligned}&\frac{g_{1}(L,\mu ,\lambda _{\max },\lambda _{\min })}{g_{2}(L,\mu ,\lambda _{\max },\lambda _{\min })} \\&= \max \left\{ \frac{L}{\mu }, \frac{\lambda _{\max }^2}{\lambda _{\min }^2}\right\} \left( 7+\frac{2}{\lambda _{\max }}+L+\frac{4}{L}\right) ^2\left( 7+6L+2\lambda _{\max }+\frac{3(1+\lambda _{\max })}{L}+\frac{4(1+L)}{\lambda _{\max }}\right) , \end{aligned}$$

the communication complexity is

$$\begin{aligned} Q = {\mathcal {O}}\left( \frac{\max \{L/\mu , \lambda _{\max }^2/\lambda _{\min }^2\}}{1-\chi } \log \frac{g_1(L,\mu ,\lambda _{\max },\lambda _{\min })}{g_2(L,\mu ,\lambda _{\max },\lambda _{\min })}\log \left( \frac{1}{\epsilon }\right) \right) . \end{aligned}$$

$\square$

Appendix C: Several important Lemmas

Lemma 13

For ${\textbf{Y}}=({\textbf{y}}_1,\cdots ,{\textbf{y}}_n)^\top \in {\mathbb {R}}^{n\times p}$, and ${\textbf{x}}\in {\mathbb {R}}^p$, we denote $\bar{{\textbf{y}}}= \frac{1}{n} \sum _{i=1}^n {\textbf{y}}_i$. It holds that

$$\begin{aligned} \left\| {\textbf{1}}\bar{{\textbf{y}}}^\top - {\textbf{1}}{\textbf{x}}^\top \right\| _2\leqslant \left\| {\textbf{Y}}-{\textbf{1}}{\textbf{x}}^\top \right\| _2. \end{aligned}$$

Proof

We have that

$$\begin{aligned} \left\| {\textbf{1}}\bar{{\textbf{y}}}^\top - {\textbf{1}}{\textbf{x}}^\top \right\| _2 = \sqrt{n} \left\| \bar{{\textbf{y}}}-{\textbf{x}}\right\| _2 \leqslant \frac{1}{\sqrt{n}}\sum _{i=1}^n \left\| {\textbf{y}}_i-{\textbf{x}}\right\| _2 \leqslant \sqrt{\sum _{i=1}^n \left\| {\textbf{y}}_i-{\textbf{x}}\right\| _2^2} = \left\| {\textbf{Y}}-{\textbf{1}}{\textbf{x}}^\top \right\| _2. \end{aligned}$$

The last inequality uses the Cauchy-Schwarz inequality. $\square$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tang, K., Liu, W. & Mao, X. Multi-consensus decentralized primal-dual fixed point algorithm for distributed learning. Mach Learn (2024). https://doi.org/10.1007/s10994-024-06537-8

Download citation

Received: 24 May 2023
Revised: 22 January 2024
Accepted: 05 March 2024
Published: 08 April 2024
DOI: https://doi.org/10.1007/s10994-024-06537-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-consensus decentralized primal-dual fixed point algorithm for distributed learning

Abstract

Access this article

Similar content being viewed by others

Hybrid ADMM: a unifying and fast approach to decentralized optimization

An Acceleration of Decentralized SGD Under General Assumptions with Low Stochastic Noise

Distributed multi-task classification: a decentralized online learning approach

Availability of data and materials

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Additional information

Appendices

Appendix A: Proof of Theorem 5

Definition 2

Lemma 6

Corollary 7

Proposition 8

Lemma 9

Lemma 10

Theorem 11

Proof of Theorem 5

1.1 A.1. Proof of Lemma 9

Proof

1.2 A.2. Proof of Theorem 11

Proposition 12

Proof of Theorem 11

Appendix B: Proof of Theorem 4

1.1 B.1. Proof of Lemma 10

Proof

1.2 B.2. Proof of Theorem 4

Proof

Appendix C: Several important Lemmas

Lemma 13

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation