Skip to main content
Log in

Localization and Approximations for Distributed Non-convex Optimization

  • Published:
Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Abstract

Distributed optimization has many applications, in communication networks, sensor networks, signal processing, machine learning, and artificial intelligence. Methods for distributed convex optimization are widely investigated, while those for non-convex objectives are not well understood. One of the first non-convex distributed optimization frameworks over an arbitrary interaction graph was proposed by Di Lorenzo and Scutari (IEEE Trans Signal Inf Process Netw 2:120–136, 2016), which iteratively applies a combination of local optimization with convex approximations and local averaging. Motivated by application problems such as the resource allocation problems in multi-cellular networks, we generalize the existing results in two ways. In the case when the decision variables are separable such that there is partial dependency in the objectives, we reduce the communication and memory complexity of the algorithm so that nodes only keep and communicate local variables instead of the whole vector of variables. In addition, we relax the assumption that the objectives’ gradients are bounded and Lipschitz by means of successive proximal approximations. The proposed algorithmic framework is shown to be more widely applicable and numerically stable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. All results can be trivially extended to directed time-varying graphs: see [19] for NEXT and the Appendix for our algorithm. For ease of presentation we adopt the above settings.

  2. We add nodes to get connectedness if it does not hold. More on this in Sect. 4.2.

  3. Note that Assumption N1 requires \(\lim _{n\rightarrow \infty }L^{\min }_n=\infty \) where \(L^{\min }_n=\min _iL_{i,n}\) with \(f^*_{i,n}\) given as (7). For any choices of \(f^*_{i,n}\) other than the double Moreau envelope function [16], Assumption N1 requires \(\lim _{n\rightarrow \infty }L_{i,n}=\infty \) for any i such that \(\nabla f_i\) is non-Lipschitz continuous.

  4. Note that this holds for \(\hat{\textbf{x}}^{\mathcal {S}_i}_{i,n}\) as well because the elements in the \(\mathcal {K}_{\mathcal {N}\setminus {\mathcal {S}_i}}\) subspace just cancel each other out.

  5. Note that \(\textbf{x}^c\) refers to \(\textbf{x}^{M+1}\) and \(M+1\) is in \({\mathcal {S}_i}\) as well if the part exists. Therefore, the second term \(L_G\Vert \textbf{x}^c_i-\tilde{\textbf{x}}^c_i\Vert \) can be put into the summation in the first term.

References

  1. Attouch, H., Azé, D.: Approximation and regularization of arbitrary functions in Hilbert spaces by the Lasry–Lions method. Annales de l’IHP Analyse non linéaire. 10(3), 289–312 (1993)

    Article  MathSciNet  Google Scholar 

  2. Beck, A., Tetruashvili, L.: On the convergence of block coordinate descent type methods. SIAM J. Optim. 23(4), 2037–2060 (2013)

    Article  MathSciNet  Google Scholar 

  3. Bianchi, P., Jakubowicz, J.: Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization. IEEE Trans. Autom. Control 58(2), 391–405 (2013)

    Article  MathSciNet  Google Scholar 

  4. Borwein, J., Zhu, Q.: Techniques of Variational Analysis. Springer, New York (2005)

    Google Scholar 

  5. Boyd, S., Diaconis, P., Xiao, L.: Fastest mixing Markov chain on a graph. SIAM Rev. 46(4), 667–689 (2004)

    Article  MathSciNet  Google Scholar 

  6. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)

    Article  Google Scholar 

  7. Chen, C., He, B., Ye, Y., Yuan, X.: The direct extension of ADMM for multi-block convex minimization problems is not necessarily convergent. Math. Program. 155(1–2), 57–79 (2016)

    Article  MathSciNet  Google Scholar 

  8. d’Aspremont, A., Scieur, D., Taylor, A.: Acceleration methods. arXiv preprint arXiv:2101.09545 (2021)

  9. Duchi, J.C., Agarwal, A., Wainwright, M.J.: Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Trans. Autom. Control 57(3), 592–606 (2012)

    Article  MathSciNet  Google Scholar 

  10. Facchinei, F., Scutari, G., Sagratella, S.: Parallel selective algorithms for nonconvex big data optimization. IEEE Trans. Signal Process. 63(7), 1874–1889 (2015)

    Article  MathSciNet  Google Scholar 

  11. Hajinezhad, D., Hong, M., Zhao, T., Wang, Z.: Nestt: a nonconvex primal-dual splitting method for distributed and stochastic optimization. In: Advances in Neural Information Processing Systems (2016)

  12. Huang, J., Subramanian, V., Agrawal, R., Berry, R.: Downlink scheduling and resource allocation for OFDM systems. IEEE Trans. Wirel. Commun. 8(1), 288–296 (2009)

    Article  Google Scholar 

  13. Jakovetić, D., Xavier, J., Moura, J.M.F.: Convergence rate analysis of distributed gradient methods for smooth optimization. In: 20th Telecomm Forum (2012)

  14. Kao, H., Subramanian, V.: Convergence rate analysis for distributed optimization with localization. In: Proceedings of Allerton Conference on Communication Control and Computing (2019)

  15. Kao, H., Subramanian, V.: Localization and approximations for distributed non-convex optimization. arXiv preprint arXiv:1706.02599v2 (2021)

  16. Lasry, J.M., Lions, P.L.: A remark on regularization in Hilbert spaces. Israel J. Math. 55(3), 257–266 (1986)

    Article  MathSciNet  Google Scholar 

  17. Li, B., Cen, S., Chen, Y., Chi, Y.: Communication-efficient distributed optimization in networks with gradient tracking and variance reduction. In: International Conference on Artificial Intelligence and Statistics (2020)

  18. Li, X., Scaglione, A.: Convergence and applications of a gossip based Gauss–Newton algorithm. IEEE Trans. Signal Process. 61(21), 5231–5246 (2013)

    Article  Google Scholar 

  19. Lorenzo, P.D., Scutari, G.: NEXT: in-network nonconvex optimization. IEEE Trans. Signal Inf. Process. Over Netw. 2(2), 120–136 (2016)

    Article  MathSciNet  Google Scholar 

  20. Mansoori, F., Wei, E.: Superlinearly convergent asynchronous distributed network Newton method. In: 2017 IEEE 56th Annual Conference on Decision and Control (CDC) (2017)

  21. Nedić, A., Ozdaglar, A.: Distributed subgradient methods for multiagent optimization. IEEE Trans. Autom. Control 54(1), 48–61 (2009)

    Article  Google Scholar 

  22. Notarstefano, G., Notarnicola, I., Camisa, A.: Distributed optimization for smart cyber-physical networks. arXiv preprint arXiv:1906.10760 (2019)

  23. Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 123–231 (2013)

    Google Scholar 

  24. Rabbat, M., Nowak, R.: Distributed optimization in sensor networks. In: Proceedings of the 3rd International Symposium on Information Processing in Sensor Networks (2004)

  25. Rockafellar, R.T., Wets, R.: Variational Analysis. Springer, Berlin (1998)

    Book  Google Scholar 

  26. Tsitsiklis, J., Bertsekas, D., Athans, M.: Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Trans. Autom. Control 31(9), 803–812 (1986)

    Article  MathSciNet  Google Scholar 

  27. Xiao, Y., Hu, J.: Distributed solutions of convex feasibility problems with sparsely coupled constraints. In: 2017 IEEE 56th Annual Conference on Decision and Control (CDC) (2017)

  28. Zhang, J., Uribe, C.A., Mokhtari, A., Jadbabaie, A.: Achieving acceleration in distributed optimization via direct discretization of the heavy-ball ODE. In: 2019 American Control Conference (ACC) (2019)

  29. Zhu, M., Martinez, S.: On the convergence time of asynchronous distributed quantized averaging algorithms. IEEE Trans. Autom. Control 56(2), 386–390 (2010)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hsu Kao.

Additional information

Communicated by Fabian Flores-Bazàn.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Generalizations to Time-Varying Graphs

In this appendix, we provide the less strict assumptions to accommodate the case where the underlying graph is time-varying and directed. The proof of our result is based on this more general setting. But the purpose of these assumptions remains the same—a distribution on the set of nodes will go to the uniform distribution exponentially fast with repeated application of the \(\textbf{W}\) matrix, which is captured in Lemma B.1.

At each time slot n, the set of nodes \(\mathcal {N}\) along with a set of time-variant directed edges \(\mathcal {E}[n]\), form a directed graph \(\mathcal {G}[n]=(\mathcal {N},\mathcal {E}[n])\). Node j can only send message to node i in time slot n if \(j\rightarrow i\in \mathcal {E}[n]\).

Assumption L

\(^{\prime }\)

(L3\(^\prime \)):

\(\mathcal {G}_m[n]\) is \(B_m\)-strongly connected for all \(m\in [M+1]\), where \(\mathcal {G}_m[n]=(\mathcal {N}_m,\mathcal {E}_m[n]=\{i\rightarrow j\in \mathcal {E}[n]:i\in \mathcal {N}_m\text { or }j\in \mathcal {N}_m\})\), i.e. \((\mathcal {N}_m,\bigcup _{n=kB_m}^{(k+1)B_m-1}\mathcal {E}_m[n])\) is strongly connected for all \(k\ge 0\);

(L4\(^\prime \)):

For all \(m\in [M+1]\) there is a matrix \(\textbf{W}^m[n]\) associated with \(\mathcal {N}_m\)—each entry is non-zero if and only if there is a corresponding edge in \(\mathcal {E}_m[n]\). As before, \(\textbf{W}^m[n]\) is doubly-stochastic after deleting the zero rows and columns whose indices are not in \(\mathcal {E}_m[n]\), and has a smaller than 1 spectral radius when subtracted by \(\frac{1}{I_m}\textbf{1}_{I_m}\textbf{1}_{I_m}^\top \).

Appendix B: Proof of the Main Result

In this appendix, we prove Theorem 3. For an intuitive description of the roadmap of the entire proof, the readers may refer to the start of Appendix B in [15]. Our proof follows the main structure of [19]; in particular, the four parts of Proposition B.2 correspond to the four parts of Proposition 9 in [19]. Some detailed derivations will be omitted due to space limit, and the complete version can be found in [15]. The convergence of NEXT basically consists of two parts—at the faster time-scale the “consensus convergence" of local variable iterates \(\textbf{x}_i\) and local gradient iterates \(\textbf{y}_i\) to the average iterate \(\bar{\textbf{x}}\) and \(\overline{\nabla f}(\bar{\textbf{x}})\triangleq \sum _{i\in \mathcal {N}}\nabla f_i(\bar{\textbf{x}})\) (the average gradient evaluated at the average iterate), respectively, and on the slower time-scale the fixed point iterations of \(\hat{\textbf{x}}_i(\bullet )\) (defined in (36)) which we call “path convergence" (for details of this viewpoint and connection to two time-scale stochastic approximation see [15]). It is shown in [19] that the fixed points of \(\hat{\textbf{x}}_i(\bullet )\) coincide with the stationary points. The goal is thus to prove the iterates converge to the fixed points of \(\hat{\textbf{x}}_i(\bullet )\) and the iterates of all nodes asymptotically agree. Our contributions are introducing partial dependency structures with a localization scheme, and in addition successively approximating the possibly non-Lipschitz gradient objective functions.

Compared to the proof of NEXT, the generalization to multiple local dependency sets is not a technically hard one; it requires some simple inequalities as shown in parts (a) and (b) of Proposition B.2. For the second generalization, we replace what was Lipschitz constant L and strongly convex constant \(\tau \) by series \(\{L_n\}\) and \(\{\tau _n\}\), which now could grow to infinity and decrease to zero, respectively. This significantly increases the hardness of the proof. The conditions in Theorem 3, the Technical Assumption T stated below, and Lemma C.1 are made such that all the series now with \(\{L_n\}\) and \(\{\tau _n\}\) still converge. The unbounded gradient issue and the correspondence nature of \(\hat{\textbf{x}}_{i,\infty }\) also make our scheme much trickier to analyze in comparison to NEXT.

1.1 B.1 Notations

We define a set of notations to proceed with the proof. All notation is defined for all i, m, and n, whenever applicable. Original variables

  • \(\textbf{x}^m[n]=(\mathbb {I}\{i\in {\mathcal {N}_m}\}\textbf{x}^m_i[n])_{i\in \mathcal {N}}=[\mathbb {I}\{1\in \mathcal {N}_m\} \textbf{x}^m_1[n]^\top \hspace{5.0pt}\cdots \hspace{5.0pt}\mathbb {I}\{I\in \mathcal {N}_m\}\textbf{x}^m_I[n]^\top ]^\top \): the concatenation of part m decision variables from all nodes in \({\mathcal {N}_m}\) with padded zero for nodes not in \({\mathcal {N}_m}\); we also use \(\textbf{x}^m[n]\) to refer to non-padded zero version \((\textbf{x}^m_i[n])_{i\in {\mathcal {N}_m}}\), i.e., the vector containing only \(\textbf{x}^m_i[n]\) when i is in \({\mathcal {N}_m}\), when the context is clear; the notation \((v_i)_{i\in \mathcal {S}}\), which denotes the vector concatenated from all the vectors of the form \(v_i\) where the index i is in the set \(\mathcal {S}\), will be used throughout this Appendix

  • \(\textbf{y}^m[n]=(\mathbb {I}\{i\in {\mathcal {N}_m}\}\textbf{y}^m_i[n])_{i\in \mathcal {N}}\): the concatenation of part m of \(\textbf{y}\), which tracks the average (among nodes) gradients of \(\nabla _{\textbf{x}^m}f_i\) from the nodes in \({\mathcal {N}_m}\) in the algorithm

  • \(\textbf{r}^m[n]=(\nabla _{\textbf{x}^m}f^*_{i,n}[n])_{i\in \mathcal {N}}\): the concatenation of ground truth gradient, with \(\nabla _{\textbf{x}^m}f^*_{i,n}[n]=\nabla _{\textbf{x}^m}f^*_{i,n}(\textbf{x}_i[n])\); adding \(\mathbb {I}\{i\in {\mathcal {N}_m}\}\) is unnecessary as the gradient would be zero for those nodes not depending on \(\textbf{x}^m\)

  • \(\varDelta \textbf{r}^m[l,n]=(\varDelta \textbf{r}^m_i[l,n])_{i\in \mathcal {N}}\): the gradient difference, with \(\varDelta \textbf{r}^m_i[l,n]=\nabla _{\textbf{x}^m}f^*_{i,l}[l]-\nabla _{\textbf{x}^m}f^*_{i,n}[n]\) (\(l\le n\))

  • \(\tilde{\mathbf {\pi }}_i[n]\): see Line 11 of Algorithm 1

  • \(\tilde{\textbf{x}}_i[n]=\tilde{\textbf{x}}^*_i(\textbf{x}_i[n],\tilde{\mathbf {\pi }}_i[n])\): see Line 5 of Algorithm 1 and (5)

Average variables

  • \(\bar{\textbf{x}}^m[n]=\frac{1}{I_m}\sum _{i\in \mathcal {N}_m}\textbf{x}^m_i[n]\): average of decision variable

  • \(\bar{\textbf{y}}^m[n]=\frac{1}{I_m}\sum _{i\in \mathcal {N}_m}\textbf{y}^1_i[n]\): average of gradient tracking variable

  • \(\bar{\textbf{r}}^m[n]=\frac{1}{I_m}\sum _{i\in \mathcal {N}_m}\nabla _{\textbf{x}^m}f^*_{i,n}[n]\): average of ground truth gradient

  • \(\varDelta \bar{\textbf{r}}^m[l,n]=\frac{1}{I_m}\sum _{i\in \mathcal {N}_m}\varDelta \textbf{r}^m_i[l,n]\): average of gradient difference

Tracking system using average variables

  • \(\nabla f^{*,av}_{i,n}[n]=\nabla f^*_{i,n}(\bar{\textbf{x}}[n])\)

  • \(\textbf{r}^{m,av}[n]=(\nabla _{\textbf{x}^m}f^{*,av}_{i,n}[n])_{i\in \mathcal {N}}\): the concatenation of ground truth gradient evaluated at average decision variable

  • \(\varDelta \textbf{r}^{m,av}[l,n]=(\varDelta \textbf{r}^{m,av}_i[l,n])_{i\in \mathcal {N}}\): the gradient difference evaluated at average decision variable, with \(\varDelta \textbf{r}^{m,av}_i[l,n]=\nabla _{\textbf{x}^m}f^{*,av}_{i,n}[l]-\nabla _{\textbf{x}^m}f^{*,av}_{i,n}[n]\)

  • \(\textbf{y}^{m,av}_i[n+1]=\sum _jw^m_{ij}[n]\textbf{y}^{m,av}_j[n]+\varDelta \textbf{r}^{m,av}_i[n+1,n]\): tracking of average gradient evaluated at average decision variable, with \(\textbf{y}^{m,av}_i[0]=\nabla _{\textbf{x}^m}f^{*,av}_{i,n}[0]\); concatenating \(\textbf{y}^{m,av}_i[n+1]\) for \(i\in \mathcal {N}\) makes \(\textbf{y}^{m,av}[n]=(\mathbb {I}\{i\in {\mathcal {N}_m}\}\textbf{y}^{m,av}_i[n])_{i\in \mathcal {N}}\)

  • \(\tilde{\mathbf {\pi }}^{m,av}_i[n]=I_m\textbf{y}^{m,av}_i[n]-\nabla _{\textbf{x}^m}f^{*,av}_{i,n}[n]\)

  • \(\tilde{\textbf{x}}^{av}_i[n]=\tilde{\textbf{x}}^*_i(\bar{\textbf{x}}_i[n],\tilde{\mathbf {\pi }}^{av}_i[n])\): optimized result evaluated at average decision variable and average tracking system

  • \(\bar{\textbf{r}}^{m,av}[n]=\frac{1}{I_m}\sum _{i\in \mathcal {N}_m}\nabla _{\textbf{x}^m}f^{*,av}_{i,n}[n]\): average of ground truth gradient evaluated at average decision variable

Doubly stochastic matrices

  • \(\textbf{P}^m[n,l]=\textbf{W}^m[n]\textbf{W}^m[n-1]\cdots \textbf{W}^m[l]\quad n\ge l\)

  • \(\hat{\textbf{W}}^m[n]=\textbf{W}^m[n]\otimes I_{d_m}\)

  • \(\hat{\textbf{P}}^m[n,l]=\hat{\textbf{W}}^m[n]\hat{\textbf{W}}^m[n-1]\cdots \hat{\textbf{W}}^m[l]=\textbf{P}^m[n,l]\otimes I_{d_m}\quad n\ge l\)

  • \(J^m=\frac{1}{I_m}\textbf{1}_{\mathcal {N}_m}\textbf{1}_{\mathcal {N}_m}^\top \otimes \textbf{I}_{d_m}\), where \(\textbf{1}_{\mathcal {N}_m}=\{\mathbb {I}\{i\in {\mathcal {N}_m}\}:i\in \mathcal {N}\}\), and \(\textbf{I}\) is the identity matrix

  • \(J^m_\perp =\textbf{I}_{d_mI_m}-J^m\), where \(\textbf{I}_{d_mI_m}\) is the identity matrix with dimension \(d_m\times I_m\)

1.2 B.2 Key Propositions

The next result is a variant of Proposition 5 in [19].

Proposition B.1

Let \(\mathbf {\pi }^{\mathcal {S}_i}_i(\tilde{\textbf{x}})=\sum _{j\ne i}\nabla _{\textbf{x}^{\mathcal {S}_i}}f^*_{j,n}(\tilde{\textbf{x}})=\left( \sum _{j\in {\mathcal {N}_m},j\ne i}\nabla _{\textbf{x}^m}f^*_{j,n}(\tilde{\textbf{x}}^{\mathcal {N}_m})\right) _{m\in {\mathcal {S}_i}}\) be the concatenation of \(\sum _{j\in {\mathcal {N}_m},j\ne i}\nabla _{\textbf{x}^m}f^*_{j,n}(\tilde{\textbf{x}}^{\mathcal {N}_m})\) for all m in \(\mathcal {S}_i\). Define the mapping \(\hat{\textbf{x}}^{\mathcal {S}_i}_{i,n}(\cdot ):\mathcal {K}\rightarrow \mathcal {K}_{\mathcal {S}_i}=\varPi _{m\in {\mathcal {S}_i}}\mathcal {K}_m\) by

$$\begin{aligned} \hat{\textbf{x}}^{\mathcal {S}_i}_{i,n}(\tilde{\textbf{x}})=\underset{\textbf{x}^{\mathcal {S}_i}}{\arg \min }\hspace{5.0pt}\tilde{f}^*_{i,n}(\textbf{x}^{\mathcal {S}_i};\tilde{\textbf{x}}^{\mathcal {S}_i})+\mathbf {\pi }^{\mathcal {S}_i}_i(\tilde{\textbf{x}})^\top (\textbf{x}^{\mathcal {S}_i}-\tilde{\textbf{x}}^{\mathcal {S}_i})+G(\textbf{x}^c)\quad \forall \hspace{5.0pt}i\in \mathcal {N}, \end{aligned}$$
(8)

and the mapping \(\hat{\textbf{x}}_{i,n}(\cdot ):\mathcal {K}\rightarrow \mathcal {K}\) by \(\hat{\textbf{x}}_{i,n}(\tilde{\textbf{x}})=\left( \hat{\textbf{x}}^{\mathcal {S}_i}_{i,n}(\tilde{\textbf{x}}),\tilde{\textbf{x}}^{\mathcal {N}\setminus {\mathcal {S}_i}}\right) \), that is, preserving everything in the \(\mathcal {K}_{\mathcal {N}\setminus {\mathcal {S}_i}}\) subspace while mapping with \(\hat{\textbf{x}}^{\mathcal {S}_i}_{i,n}(\cdot )\) in the \({\mathcal {S}_i}\) subspace. Then, under Assumptions A, F, and N, the mapping \(\hat{\textbf{x}}_{i,n}(\cdot )\) has the following properties:

  1. (a)

    \(\forall \hspace{5.0pt}\textbf{z}\in \mathcal {K}\) and \(i\in \mathcal {N}\),

    $$\begin{aligned} (\hat{\textbf{x}}_{i,n}(\textbf{z})-\textbf{z})^\top \nabla F(\textbf{z})+G(\hat{\textbf{x}}_{i,n}(\textbf{z}))-G(\textbf{z})\le -\tau ^{\min }_n\Vert \hat{\textbf{x}}_{i,n}(\textbf{z})-\textbf{z}\Vert ^2, \end{aligned}$$

    where \(F(\textbf{x})=\sum _{i=1}^If_i(\textbf{x})\). Here we use \(G(\textbf{x})\) and \(G(\textbf{x}^c)\) interchangeably as they are the same thing.

  2. (b)

    \(\hat{\textbf{x}}_{i,n}(\cdot )\) is Lipschitz continuous, i.e. \(\Vert \hat{\textbf{x}}_{i,n}(\textbf{w})-\hat{\textbf{x}}_{i,n}(\textbf{z})\Vert \le L_{i,n}\Vert \textbf{w}-\textbf{z}\Vert \quad \forall \hspace{5.0pt}\textbf{w},\textbf{z}\in \mathcal {K}\) for \(i\in \mathcal {N}\).Footnote 4

The only thing we do is to substitute \(\tilde{f}_i\) in [19] as \(\tilde{f}^*_{i,n}\). Although the equations are written in localization form, it does not really change anything here.

Lemma B.1

Define \(\textbf{P}[n,l]\triangleq \textbf{W}[n]\textbf{W}[n-1]\cdots \textbf{W}[l]\). Then under Assumption L\(^\prime \), for some \(c_0>0\) and \(\rho \in (0,1)\),

$$\begin{aligned} \left\| \textbf{P}[n,l]-\frac{1}{I}\textbf{1}\textbf{1}^\top \right\| \le c_0\rho ^{n-l+1},\hspace{5.0pt}\forall \hspace{5.0pt}n\ge l. \end{aligned}$$

Strictly speaking the above is for the \(\textbf{W}^c=\textbf{W}^{M+1}\), which is the matrix used for the averaging of the entire network, in Assumption L\(^\prime \). For \(\textbf{W}^m\) where \(m\in [M]\), the Lemma also holds after we delete the zero rows/columns. In this case, the product converges to \(\frac{1}{I_m}\textbf{1}\textbf{1}^\top \) where the \(\textbf{1}\) is of the proper dimension. From now on we will take \(\rho \) as the largest geometric convergence factor among all \(\textbf{W}^m\), \(m\in [M+1]\).

Before proving Theorem 3, we will first prove the following proposition.

Proposition B.2

Let \(\{\textbf{x}^m[n]\}_n\triangleq \{(\textbf{x}^m_i[n])_{i\in \mathcal {N}_m}\}\) and \(\{\bar{\textbf{x}}^m[n]\}_n\triangleq \left\{ \frac{1}{I_m}\sum _{i\in \mathcal {N}_m}\textbf{x}^m_i[n]\right\} _n\), \(m\in [M+1]\) be the sequences generated by Algorithm 1, in the settings of the Theorem 3. Then the following holds:

  1. (a)

    For all n, \(\Vert \textbf{x}^{m,inx}_i[n]-\textbf{x}^m_i[n]\Vert \le \frac{c^mL_{i,n}}{\tau _{i,n}}\quad \forall \hspace{5.0pt}i\in \mathcal {N}_m,m\in [M+1]\).

  2. (b)

    \(\lim _{n\rightarrow \infty }\Vert \textbf{x}^m_i[n]-\bar{\textbf{x}}^m[n]\Vert =0\), \(\sum _{n=1}^\infty \alpha [n]\Vert \textbf{x}^m_i[n]-\bar{\textbf{x}}^m[n]\Vert <\infty \), \(\sum _{n=1}^\infty \Vert \textbf{x}^m_i[n]-\bar{\textbf{x}}^m[n]\Vert ^2<\infty \quad \forall \hspace{5.0pt}i\in \mathcal {N}_m,m\in [M+1]\).

  3. (c)

    \(\lim _{n\rightarrow \infty }\Vert \tilde{\textbf{x}}^{av}_i[n]-\hat{\textbf{x}}^{\mathcal {S}_i}_{i,n}(\bar{\textbf{x}}[n])\Vert =0\), \(\sum _{n=1}^\infty \alpha [n]L^{\max }_n\Vert \tilde{\textbf{x}}^{av}_i[n]-\hat{\textbf{x}}^{\mathcal {S}_i}_{i,n}(\bar{\textbf{x}}[n])\Vert \) \(<\infty \quad \forall \hspace{5.0pt}i\in \mathcal {N}\).

  4. (d)

    \(\lim _{n\rightarrow \infty }\Vert \tilde{\textbf{x}}^m_i[n]-\tilde{\textbf{x}}^{m,av}_i[n]\Vert =0\), \(\sum _{n=1}^\infty \alpha [n]L^{\max }_n\Vert \tilde{\textbf{x}}^m_i[n]-\tilde{\textbf{x}}^{m,av}_i[n]\Vert <\infty \quad \forall \hspace{5.0pt}i\in \mathcal {N}_m,m\in [M+1]\).

We will use the following assumption frequently when proving Proposition B.2. These are the exact technical inequalities we use in the proof, while all of them are implicitly implied by the conditions of Theorem 3 as we will show below.

Technical Assumption T

(T1):

\(\lim _{n\rightarrow \infty }\rho ^n\frac{L^{\max }_n}{\tau ^{\min }_n}=0\);

(T2):

\(\lim _{n\rightarrow \infty }\frac{L^{\max }_n}{\tau ^{\min }_n}\sum _{l=0}^{n-1}\rho ^{n-l}\frac{\alpha [l](L^{\max }_l)^2}{\tau ^{\min }_l}=0\);

(T3):

\(\lim _{n\rightarrow \infty }\frac{1}{\tau ^{\min }_n}\sum _{l=0}^{n-1}\rho ^{n-l}\Vert \nabla f^*_{i,l}(\textbf{x})-\nabla f^*_{i,l-1}(\textbf{x})\Vert =0\) for all \(\textbf{x}\) and i;

(T4):

\(\sum _{n=1}^\infty \rho ^n\frac{L^{\max }_n}{\tau ^{\min }_n}<\infty \);

(T5):

\(\sum _{n=1}^\infty \frac{\alpha [n]L^{\max }_n}{\tau ^{\min }_n}\sum _{l=0}^{n-1}\rho ^{n-l}\frac{\alpha [l](L^{\max }_l)^2}{\tau ^{\min }_l}<\infty \);

(T6):

\(\sum _{n=1}^\infty \frac{\alpha [n]L^{\max }_n}{\tau ^{\min }_n}\sum _{l=0}^{n-1}\rho ^{n-l}\Vert \nabla f^*_{i,l}(\textbf{x})-\nabla f^*_{i,l-1}(\textbf{x})\Vert <\infty \) for all \(\textbf{x}\) and i.

Proof

(T1)::

it should be evident from the conditions of Theorem 3 that none of the parameters could be growing or decaying at an exponential rate. We are considering the setting where \(\alpha [n]\) and \(\tau ^{\min }_n\) are going to zero while \(L^{\max }_n\) is going to infinity. The condition \(\sum _{n=0}^\infty (L^{\max }_n)^3\left( \frac{\alpha [n]}{\tau ^{\min }_n}\right) ^2<\infty \) implies that if either \(L^{\max }_n\) is growing exponentially or \(\tau ^{\min }_n\) is decaying exponentially, then \(\alpha [n]\) must also be decaying exponentially. But then \(\sum _{n=0}^\infty \tau ^{\min }_n\alpha [n]=\infty \) would never be possible.

(T2)::

recall the conditions of Theorem 3 imply \(\lim _{n\rightarrow \infty }\alpha [n]\frac{(L^{\max }_n)^3}{(\tau ^{\min }_n)^3}=0\), then apply the first part of Lemma C.1.

(T3)::

from \(\lim _{n\rightarrow \infty }\frac{\eta ^{\max }_n}{\tau ^{\min }_n}=0\) and the first part of Lemma C.1.

(T4)::

again the parameters are not growing at an exponential rate.

(T5)::

from \(\sum _{n=0}^\infty (L^{\max }_n)^3\left( \frac{\alpha [n]}{\tau ^{\min }_n}\right) ^2<\infty \) and the second part of Lemma C.1.

(T6)::

from \(\sum _{n=0}^\infty \frac{\alpha [n]L^{\max }_n\eta ^{\max }_n}{\tau ^{\min }_n}<\infty \) and the second part of Lemma C.1.\(\square \)

1.3 B.3 Proof of Proposition B.2 (a)

Consider a local dependency set \({\mathcal {N}_m}\) and any node \(i\in {\mathcal {N}_m}\). By the definition of \(\textbf{x}^{\mathcal {S}_i}_i\) defined in the minimization of (5), we have

$$\begin{aligned}{} & {} \sum _{m\in {\mathcal {S}_i}}(\textbf{x}^m_i[n]-\tilde{\textbf{x}}^m_i[n])^\top \left[ \nabla _{\textbf{x}^m_i}\tilde{f}^*_{i,n} (\tilde{\textbf{x}}^{\mathcal {S}_i}_i[n];\textbf{x}^{\mathcal {S}_i}_i[n])+\tilde{\mathbf {\pi }}^m_i[n]\right] \nonumber \\{} & {} \quad +(\textbf{x}^c_i[n]-\tilde{\textbf{x}}^c_i[n])^\top \partial G(\tilde{\textbf{x}}^c_i[n])\ge 0. \end{aligned}$$
(9)

From the Line 11 of Algorithm 1 and (F2\(^\prime \)), we have

$$\begin{aligned} \tilde{\mathbf {\pi }}^m_i[n]=I_m\cdot \textbf{y}^m_i[n]-\nabla _{\textbf{x}^m_i}\tilde{f}^*_{i,n}(\textbf{x}^{\mathcal {S}_i}_i[n];\textbf{x}^{\mathcal {S}_i}_i[n]). \end{aligned}$$
(10)

Substitute this result into (9), and use the strong convexity of \(\tilde{f}^*_{i,n}\), C-S inequality, and (A2) to get

$$\begin{aligned} \tau _{i,n}\Vert \textbf{x}^{\mathcal {S}_i}_i-\tilde{\textbf{x}}^{\mathcal {S}_i}_i\Vert ^2= & {} \tau _{i,n}\sum _{m\in {\mathcal {S}_i}}\Vert \textbf{x}^m_i -\tilde{\textbf{x}}^m_i\Vert ^2\nonumber \\\le & {} \sum _{m\in {\mathcal {S}_i}}(I_m\Vert \textbf{y}^m_i\Vert )\cdot \Vert \textbf{x}^m_i-\tilde{\textbf{x}}^m_i\Vert +L_G\Vert \textbf{x}^c_i-\tilde{\textbf{x}}^c_i\Vert . \end{aligned}$$
(11)

We have omitted all the time indices in (11) since all the variables have the same time index [n].

Suppose that \(\Vert \textbf{y}^m_i\Vert \) is bounded by \(l_mL_{i,n}\) for all \(m\in {\mathcal {S}_i}\). Then (11) is of the form

$$\begin{aligned} \tau _{i,n}\sum _{m\in {\mathcal {S}_i}}\Vert \textbf{x}^m_i-\tilde{\textbf{x}}^m_i\Vert ^2\le \sum _{m\in {\mathcal {S}_i}}l_mL_{i,n}\Vert \textbf{x}^m_i-\tilde{\textbf{x}}^m_i\Vert , \end{aligned}$$
(12)

which implies that all \(\Vert \textbf{x}^m_i-\tilde{\textbf{x}}^m_i\Vert \)’s are bounded by \(\frac{\sum _{m\in {\mathcal {S}_i}}l_mL_{i,n}}{\tau _{i,n}}\).Footnote 5 This is due to the following argument: if \(\{x_i\},\{l_i\}\) are non-negative and \(\sum _ix^2_i\le \sum _il_ix_i\), then \(\max \{x_i\}\le \sum _il_i\); otherwise, W.O.L.G. we can assume \(x_1=\max \{x_i\}\) and hence \(\sum _il_i<x_1\), then it holds that \(\sum _ix^2_i>x^2_1>x_1\sum _il_i>\sum _il_ix_i\), which is a contradiction. Thus, with (12), we get

$$\begin{aligned} \Vert \textbf{x}^{m,inx}_i[n]-\textbf{x}^m_i[n]\Vert \le \epsilon ^m_i[n]+\frac{\sum _{m\in {\mathcal {S}_i}}l_mL_{i,n}}{\tau _{i,n}}\le \frac{c^mL_{i,n}}{\tau _{i,n}}, \end{aligned}$$

where \(c^m\) is some constant independent of n and i. This proves the claim. It only remains to show that \(\Vert \textbf{y}^m_i\Vert \) is actually bounded by \(l_mL_{i,n}\).

We use mathematical induction to finish the proof. The statement is that

$$\begin{aligned} \Vert \varDelta \textbf{x}^{m,inx}_i[n]\Vert =\Vert \textbf{x}^{m,inx}_i[n]-\textbf{x}^m_i[n]\Vert \le \frac{c^mL_{i,n}}{\tau _{i,n}},\hspace{5.0pt}\Vert \textbf{y}^m_i[n]\Vert \le l_mL_{i,n} \end{aligned}$$
(13)

holds for all n. We have already shown that the latter implies the former. The base case is obvious as we initialize \(\textbf{y}^m_i[0]\) to be \(\nabla _{\textbf{x}^m}f^*_{i,0}[0]\), which is assumed to be Lipschitz continuous. For the induction step, we assume the statement is true for \(n-1\) and proved the latter part \(\hspace{5.0pt}\Vert \textbf{y}^m_i[n]\Vert \le l_mL_{i,n}\) holds for n.

By the definition of \(\textbf{y}\), we have

$$\begin{aligned} \textbf{y}^m_i[n]=\hat{\textbf{W}}^m[n-1]\textbf{y}^m_i[n-1]+\varDelta \textbf{r}^m_i[n,n-1] \end{aligned}$$
(14)

where

$$\begin{aligned} \begin{aligned}&\quad \Vert \varDelta \textbf{r}^m_i[n,n-1]\Vert =\Vert \nabla f^*_{i,n}(\textbf{x}_i[n])-\nabla _{\textbf{x}^m}f^*_{i,n-1}(\textbf{x}_i[n-1])\Vert \\&\le L_{i,n}\Vert \textbf{x}_i[n]-\textbf{x}_i[n-1]\Vert +\Vert \nabla _{\textbf{x}^m}f^*_{i,n}(\textbf{x}_i[n-1])-\nabla _{\textbf{x}^m}f^*_{i,n-1}(\textbf{x}_i[n-1])\Vert . \end{aligned} \end{aligned}$$
(15)

To reach the last line we utilize the triangle inequality and the Lipschitz continuity of \(\nabla f^*_{i,n}\). For the first term, with Fact 11 (c) and (e),

$$\begin{aligned} \Vert \textbf{x}^m_i[n]-\textbf{x}^m_i[n-1]\Vert \le c_1\alpha [n-1]\Vert \varDelta \textbf{x}^{m,inx}[n-1]\Vert . \end{aligned}$$
(16)

Using the induction hypothesis of \(\varDelta \textbf{x}\), we obtain

$$\begin{aligned} \begin{aligned}&\quad \Vert \varDelta \textbf{r}^m_i[n,n-1]\Vert \le L_{i,n}\Vert \textbf{x}_i[n]-\textbf{x}_i[n-1]\Vert +c_2\eta _{i,n}\\&\le c_1\alpha [n-1]L_{i,n}\Vert \varDelta \textbf{x}^{m,inx}[n-1]\Vert +c_2\eta _{i,n} \le \frac{c_3\alpha [n-1]L_{i,n}L_{i,n-1}}{\tau _{i,n-1}}+c_2\eta _{i,n}. \end{aligned} \end{aligned}$$

Therefore, we finally obtain

$$\begin{aligned} \Vert \textbf{y}^m_i[n]\Vert \le L_{i,n}\left( c_4l_m+\frac{c_3\alpha [n-1]L_{i,n-1}}{\tau _{i,n-1}}+\frac{c_2\eta _{i,n}}{L_{i,n}}\right) \le c_5L_{i,n} \end{aligned}$$

using the induction hypothesis of \(\varDelta \textbf{y}\) and the fact that \(\frac{\alpha [n-1]L_{i,n-1}}{\tau _{i,n-1}}\) also goes to zero when \(n\rightarrow \infty \) implied by the condition of Theorem 3.

1.4 B.4 Proof of Proposition B.2 (b)

We only prove the case for \(\textbf{x}^c=\textbf{x}^{M+1}\) to save the ubiquitous subscript of m. The proof of the claims for general \(\textbf{x}^m\) is exactly the same with appropriate substitutions of \(\textbf{x}^c\), \(\hat{\textbf{W}}\), \(\textbf{P}\), \(\textbf{r}^c\), J, \(J_\perp \), \(\textbf{1}_I\), and I by \(\textbf{x}^m\), \(\hat{\textbf{W}}^m\), \(\textbf{P}^m\), \(\textbf{r}^m\), \(J^m\), \(J^m_\perp \), \(\textbf{1}_{\mathcal {N}_m}\), and \(I_m\).

  1. (i)

    \( \textbf{x}^c[n]-\textbf{1}_I\otimes \bar{\textbf{x}}^c[n]=\textbf{x}^c[n]-J\textbf{x}^c[n]=J_\perp \textbf{x}^c[n]\triangleq \textbf{x}^c_\perp [n].\) Notice that with Fact 11 (d) and (e), the difference of \(\textbf{x}^c[n]\) and \(\textbf{1}_I\otimes \bar{\textbf{x}}^c[n]\) which is \(\textbf{x}^c_\perp [n]\) can be expressed as a linear combination of \(\textbf{x}^c_\perp [n-1]\) and \(\varDelta \textbf{x}^{c,inx}[n-1]\). We can thus expand \(\textbf{x}^c_\perp [n]\) iteratively

    $$\begin{aligned} \begin{aligned} \textbf{x}^c_\perp [n]&=\left[ \left( \textbf{P}[n-1,0]-\frac{1}{I}\textbf{1}_I\textbf{1}_I^\top \right) \otimes I_m\right] \textbf{x}^c_\perp [0]\\&\quad +\sum _{l=0}^{n-1}\left[ \left( \textbf{P}[n-1,l]-\frac{1}{I}\textbf{1}_I\textbf{1}_I^\top \right) \otimes I_m\right] \alpha [l]\varDelta \textbf{x}^{c,inx}[l] \end{aligned} \end{aligned}$$
    (17)

    from Fact 11 (b). From Proposition B.2 (a) we know

    $$\begin{aligned} \Vert \varDelta \textbf{x}^{c,inx}[n]\Vert \le \Vert \varDelta \textbf{x}^{inx}[n]\Vert= & {} \sqrt{\sum _{i=1}^I\Vert \varDelta \textbf{x}^{inx}_i[n]\Vert ^2}\le c_1\max _{i}\Vert \varDelta \textbf{x}^{inx}_i[n]\Vert \nonumber \\\le & {} \frac{c_2L^{\max }_n}{\tau ^{\min }_n} \end{aligned}$$
    (18)

    for some constants \(c_1\) and \(c_2\). Consequently, we get

    $$\begin{aligned} \Vert \textbf{x}^c_\perp [n]\Vert \le c_3\rho ^n+c_4\sum _{l=0}^{n-1}\rho ^{n-l}\frac{\alpha [l]L^{\max }_l}{\tau ^{\min }_l}\xrightarrow {n\rightarrow \infty }0 \end{aligned}$$
    (19)

    by first utilizing triangle inequality, and then using (18), Lemma B.1, and finally [19] Lemma 7 (a). Remark that \(\lim _{n\rightarrow \infty }\alpha [n]\left( \frac{L^{\max }_n}{\tau ^{\min }_n}\right) ^3=0\) implies \(\lim _{n\rightarrow \infty }\frac{\alpha [n]L^{\max }_n}{\tau ^{\min }_n}=0\), which we use in (19) as the condition of [19] Lemma 7 (a).

  2. (ii)

    \(\lim _{n\rightarrow \infty }\sum _{k=1}^n\alpha [k]\Vert \textbf{x}^c_i[k]-\bar{\textbf{x}}^c[k]\Vert \le \lim _{n\rightarrow \infty }\sum _{k=1}^n\alpha [k]\left( c_3\rho ^k+c_4\sum _{l=0}^{k-1}\rho ^{k-l}\frac{\alpha [l]L^{\max }_l}{\tau ^{\min }_l}\right) \) \(\quad \le \lim _{n\rightarrow \infty }\left( c_3\sum _{k=1}^n\rho ^k\alpha [k]+c_4\rho \sum _{k=1}^n\sum _{l=1}^k\rho ^{k-l}\alpha [k]\frac{\alpha [l-1]L^{\max }_{l-1}}{\tau ^{\min }_{l-1}}\right) <\infty .\) The bound for the last term comes from [19] Lemma 7 (b).

  3. (iii)
    $$\begin{aligned} \begin{aligned} \lim _{n\rightarrow \infty }\sum _{k=1}^n\Vert \textbf{x}^c_\perp [k]\Vert ^2&\le \lim _{n\rightarrow \infty }\left( c_3^2\sum _{k=1}^n\rho ^{2k}+2c_3c_4\rho \sum _{k=1}^n\sum _{l=1}^k\rho ^{2k-l}\frac{\alpha [l-1]L^{\max }_{l-1}}{\tau ^{\min }_{l-1}}\right. \\&\left. \qquad +c_4^2\rho ^2\sum _{k=1}^n\sum _{l=1}^k\sum _{t=1}^k\rho ^{2k-l-t}\frac{\alpha [l-1]L^{\max }_{l-1}}{\tau ^{\min }_{l-1}}\frac{\alpha [t-1]L^{\max }_{t-1}}{\tau ^{\min }_{t-1}}\right) <\infty . \end{aligned} \end{aligned}$$

    The double summation is bounded due to the second equality [19] Lemma 7 (b) with \((\lambda ,\beta [k],\nu [l])\) being \((\rho ,\rho ^k,\frac{\alpha [l-1]L^{\max }_{l-1}}{\tau ^{\min }_{l-1}})\). The condition of Theorem 3\(\sum _{n=1}^\infty (L^{\max }_n)^3\left( \frac{\alpha [n]}{\tau ^{\min }_n}\right) ^2\) \(<\infty \) guarantees that \(\sum _{n=1}^\infty \left( \frac{\alpha [n]L^{\max }_n}{\tau ^{\min }_n}\right) ^2<\infty \). The triple summation term is less than

    $$\begin{aligned}\begin{aligned} \lim _{n\rightarrow \infty }\sum _{k=1}^n\sum _{l=1}^k\sum _{t=1}^k\rho ^{k-l}\cdot \rho ^{k-t}\cdot \frac{\left( \frac{\alpha [l-1]L^{\max }_{l-1}}{\tau ^{\min }_{l-1}}\right) ^2+\left( \frac{\alpha [t-1]L^{\max }_{t-1}}{\tau ^{\min }_{t-1}}\right) ^2}{2}<\infty \end{aligned}\end{aligned}$$

    due to the first equality of [19] Lemma 7 (b). Again, the convergence of \(\sum _{n=1}^\infty \left( \frac{\alpha [n]L^{\max }_n}{\tau ^{\min }_n}\right) ^2\) is implied by the convergence of \(\sum _{n=1}^\infty (L^{\max }_n)^3\left( \frac{\alpha [n]}{\tau ^{\min }_n}\right) ^2\).

1.5 B.5 Proof of Proposition B.2 (c)

We exploit the optimality of \(\tilde{\textbf{x}}^{av}_i\) and (F1\(^\prime \)) and (A2) to get

$$\begin{aligned} \left[ \hat{\textbf{x}}^{\mathcal {S}_i}_{i,n}(\bar{\textbf{x}})-\tilde{\textbf{x}}^{av}_i\right] ^\top \left[ \nabla _{\textbf{x}^{\mathcal {S}_i}}\tilde{f}^*_{i,n}(\tilde{\textbf{x}}^{av}_i;\bar{\textbf{x}}^{\mathcal {S}_i})+\tilde{\mathbf {\pi }}^{av}_i+(\textbf{0}^{{\mathcal {S}_i}\setminus \{c\}},\partial G(\tilde{\textbf{x}}^{c,av}_i))\right] \ge 0; \end{aligned}$$
(20)

and the optimality of \(\bar{\textbf{x}}\) (for the mapping of \(\hat{\textbf{x}}^{\mathcal {S}_i}_{i,n}(\bar{\textbf{x}})\)) leads to

$$\begin{aligned} \left[ \tilde{\textbf{x}}^{av}_i-\hat{\textbf{x}}^{\mathcal {S}_i}_{i,n}(\bar{\textbf{x}})\right] ^\top \left[ \nabla _{\textbf{x}^{\mathcal {S}_i}}\tilde{f}^*_{i,n}(\hat{\textbf{x}}^{\mathcal {S}_i}_{i,n}(\bar{\textbf{x}});\bar{\textbf{x}}^{\mathcal {S}_i})+\mathbf {\pi }^{\mathcal {S}_i}_i(\bar{\textbf{x}})+(\textbf{0}^{{\mathcal {S}_i}\setminus \{c\}},\partial G(\hat{\textbf{x}}^c_{i,n}(\bar{\textbf{x}}))\right] \ge 0. \end{aligned}$$
(21)

\(\textbf{0}^{{\mathcal {S}_i}\setminus \{c\}}\) is an all zero vector in the subspace \(\mathcal {K}_{{\mathcal {S}_i}\setminus \{c\}}\). It should be clear that \(\hat{\textbf{x}}^c_{i,n}(\bar{\textbf{x}})\) refers to the component of \(\hat{\textbf{x}}_{i,n}(\bar{\textbf{x}})\) in the subspace \(\mathcal {K}_c\). Then from (F1\(^\prime \)), (A2), (20), (21), and C-S inequality,

$$\begin{aligned} \begin{aligned} \tau _{i,n}\left\| \hat{\textbf{x}}^{\mathcal {S}_i}_{i,n}(\bar{\textbf{x}})-\tilde{\textbf{x}}^{av}_i\right\| ^2 \le&\left\| \hat{\textbf{x}}^{\mathcal {S}_i}_{i,n}(\bar{\textbf{x}})-\tilde{\textbf{x}}^{av}_i\right\| \cdot \left\| \tilde{\mathbf {\pi }}^{av}_i-\mathbf {\pi }^{\mathcal {S}_i}_i(\bar{\textbf{x}})\right\| . \end{aligned} \end{aligned}$$
(22)

From (22),

$$\begin{aligned} \left\| \hat{\textbf{x}}^{\mathcal {S}_i}_{i,n}(\bar{\textbf{x}})-\tilde{\textbf{x}}^{av}_i\right\| \le \frac{1}{\tau _{i,n}} \left\| \tilde{\mathbf {\pi }}^{av}_i-\mathbf {\pi }^{\mathcal {S}_i}_i(\bar{\textbf{x}})\right\| \le \frac{1}{\tau _{i,n}}\sum _{m\in {\mathcal {S}_i}}I_m\Vert \textbf{y}^{m,av}-\textbf{1}_{\mathcal {N}_m}\otimes \bar{\textbf{r}}^{m,av}\Vert . \end{aligned}$$

Up until now, the context is clear enough to allow us to drop all [n] time index. Again, we only focus on the case of \(m=M+1\); that is, proving \(\frac{1}{\tau _{i,n}}\Vert \textbf{y}^{c,av}-\textbf{1}_I\otimes \bar{\textbf{r}}^{c,av}\Vert \) goes to zero. We iteratively expand \(\textbf{y}^{c,av}[n]\) to get

$$\begin{aligned} \textbf{y}^{c,av}[n]=\hat{\textbf{P}}[n-1,0]\textbf{r}^{c,av}[0]+\sum _{l=1}^{n-1}\hat{\textbf{P}}[n-1,l]\varDelta \textbf{r}^{c,av}[l,l-1]+\varDelta \textbf{r}^{c,av}[n,n-1]; \end{aligned}$$
(23)

similarly,

$$\begin{aligned} \textbf{1}_I\otimes \bar{\textbf{r}}^{c,av}[n]=J\textbf{r}^{c,av}[0]+\sum _{l=1}^{n-1}J\varDelta \textbf{r}^{c,av}[l,l-1]+J\varDelta \textbf{r}^{c,av}[n,n-1]. \end{aligned}$$
(24)

Similar to (15), we have

$$\begin{aligned} \Vert \varDelta \textbf{r}^{c,av}[l,l-1]\Vert \le IL^{\max }_{l-1}\Vert \bar{\textbf{x}}[l]-\bar{\textbf{x}}[l-1]\Vert +\sum _{i\in \mathcal {N}}\Vert \nabla _{\textbf{x}^c}f^*_{i,l}(\bar{\textbf{x}}[l])-\nabla _{\textbf{x}^c}f^*_{i,l-1}(\bar{\textbf{x}}[l])\Vert . \end{aligned}$$
(25)

By combining (23), (24), Lemma B.1, (25), Proposition B.2 (a), (T1), (T2), and (T3), we have

$$\begin{aligned} \begin{aligned}&\frac{1}{\tau _{i,n}}\Vert \textbf{y}^{c,av}[n]-\textbf{1}_I\otimes \bar{\textbf{r}}^{c,av}[n]\Vert \\&\le c_1\frac{\rho ^n}{\tau ^{\min }_n}+c_2\sum _{l=1}^{n-1}\frac{\rho ^{n-l}}{\tau ^{\min }_n}\sum _i\Vert \nabla _{\textbf{x}^c}f^*_{i,l}(\bar{\textbf{x}}[l])-\nabla _{\textbf{x}^c}f^*_{i,l-1}(\bar{\textbf{x}}[l])\Vert \\&+c_3\frac{1}{\tau ^{\min }_n}\sum _i\Vert \nabla _{\textbf{x}^c}f^*_{i,n}(\bar{\textbf{x}}[n])-\nabla _{\textbf{x}^c}f^*_{i,n-1}(\bar{\textbf{x}}[n])\Vert +c_4\sum _{l=1}^{n-1}\rho ^{n-l}\frac{\alpha [l-1](L^{\max }_{l-1})^2}{\tau ^{\min }_n\tau ^{\min }_{l-1}}\\&+c_5\frac{\alpha [n-1](L^{\max }_{n-1})^2}{\tau ^{\min }_n\tau ^{\min }_{n-1}}\xrightarrow {n\rightarrow \infty }0. \end{aligned} \end{aligned}$$
(26)

In the last equation, we also have \(\lim _{n\rightarrow \infty }\frac{\alpha [n](L^{\max }_{n})^2}{(\tau ^{\min }_{n})^2}=0\) and \(\lim _{n\rightarrow \infty }\frac{1}{\tau ^{\min }_n}\Vert \nabla f^*_{i,n}(\bar{\textbf{x}}[n])-\nabla f^*_{i,n-1}(\bar{\textbf{x}}[n])\Vert =0\) implied by the conditions of Theorem 3. For the second part of the claim, we can equivalently prove \(\sum _{n=1}^\infty \frac{\alpha [n]L^{\max }_n}{\tau ^{\min }_n}\Vert \textbf{y}^{c,av}[n]-\textbf{1}_I\otimes \bar{\textbf{r}}^{c,av}[n]\Vert \) is finite. This is true because the quantity is bounded by

$$\begin{aligned} \begin{aligned}&c_4\sum _{n=1}^\infty \alpha [n]L^{\max }_n\sum _{l=1}^{n-1}\rho ^{n-l}\frac{\alpha [l-1](L^{\max }_{l-1})^2}{\tau ^{\min }_n\tau ^{\min }_{l-1}} +c_5\sum _{n=1}^\infty \frac{\alpha [n]\alpha [n-1]L^{\max }_n(L^{\max }_{n-1})^2}{\tau ^{\min }_n\tau ^{\min }_{n-1}}\\&\qquad +c_2\sum _{n=1}^\infty \alpha [n]L^{\max }_n\sum _{l=1}^{n-1}\frac{\rho ^{n-l}}{\tau ^{\min }_n}\sum _i\Vert \nabla _{\textbf{x}^c}f^*_{i,l}(\bar{\textbf{x}}[l])-\nabla _{\textbf{x}^c}f^*_{i,l-1}(\bar{\textbf{x}}[l])\Vert \\&\qquad +c_3\sum _{n=1}^\infty \frac{\alpha [n]L^{\max }_n}{\tau ^{\min }_n}\sum _i\Vert \nabla _{\textbf{x}^c}f^*_{i,n}(\bar{\textbf{x}}[n])-\nabla _{\textbf{x}^c}f^*_{i,n-1}(\bar{\textbf{x}}[n])\Vert +c_1\sum _{n=1}^\infty \rho ^n\frac{\alpha [n]L^{\max }_n}{\tau ^{\min }_n}. \end{aligned} \end{aligned}$$

All the terms are finite because of the following. The first term is due to (T4)—after multiplying a going-to-zero \(\alpha [n]\), the term remains to be bounded. The second term is due to (T5). The third term is in the condition of Theorem 3. The fourth term is due to (T6). The last term is also in the condition of Theorem 3.

1.6 B.6 Proof of Proposition B.2 (d)

Recall we have

$$\begin{aligned}{} & {} \tilde{\textbf{x}}_i[n]=\underset{\textbf{x}_i\in \mathcal {K}_{\mathcal {S}_i}}{\arg \min }\hspace{5.0pt}\tilde{U}_{i,n}(\textbf{x}_i;\textbf{x}_i[n],\tilde{\mathbf {\pi }}_i[n]) \quad \text {and}\\{} & {} \quad \tilde{\textbf{x}}^{av}_i[n]=\underset{\textbf{x}_i\in \mathcal {K}_{\mathcal {S}_i}}{\arg \min }\hspace{5.0pt}\tilde{U}_{i,n}(\textbf{x}_i;\bar{\textbf{x}}_i[n],\tilde{\mathbf {\pi }}^{av}_i[n]), \end{aligned}$$

where

$$\begin{aligned} \tilde{U}_{i,n}(\textbf{x}_i;\textbf{x}_i[n],\tilde{\mathbf {\pi }}_i[n])=\tilde{f}^*_{i,n}(\textbf{x}_i;\textbf{x}_i[n]) +\sum _{k\in \mathcal {S}_i}\tilde{\mathbf {\pi }}^k_i[n]^\top (\textbf{x}^k_i-\textbf{x}^k_i[n])+G(\textbf{x}^c). \end{aligned}$$

These along with (F1\(^\prime \)) and (A2) lead to the following:

$$\begin{aligned} (\tilde{\textbf{x}}^{av}_i-\tilde{\textbf{x}}_i)^\top \cdot \left[ \nabla _{\textbf{x}^{\mathcal {S}_i}}\tilde{f}^*_{i,n}(\tilde{\textbf{x}}_i;\textbf{x}_i)+\tilde{\mathbf {\pi }}_i+(\textbf{0}^{{\mathcal {S}_i}\setminus \{c\}},\partial G(\tilde{\textbf{x}}^c_i))\right] \ge 0, \end{aligned}$$
(27)
$$\begin{aligned} (\tilde{\textbf{x}}_i-\tilde{\textbf{x}}^{av}_i)^\top \cdot \left[ \nabla _{\textbf{x}^{\mathcal {S}_i}}\tilde{f}^*_{i,n}(\tilde{\textbf{x}}^{av}_i;\bar{\textbf{x}}_i)+\tilde{\mathbf {\pi }}^{av}_i+(\textbf{0}^{{\mathcal {S}_i}\setminus \{c\}},\partial G(\tilde{\textbf{x}}^{c,av}_i))\right] \ge 0. \end{aligned}$$
(28)

As we did in (22), using (A2), (F1\(^\prime \)), (27), (28), and (N1),

$$\begin{aligned} \tau _{i,n}\Vert \tilde{\textbf{x}}_i-\tilde{\textbf{x}}^{av}_i\Vert ^2\le \Vert \tilde{\textbf{x}}_i-\tilde{\textbf{x}}^{av}_i\Vert \cdot \left[ 2L_{i,n}\Vert \bar{\textbf{x}}_i-\textbf{x}_i\Vert +\sum _{m\in {\mathcal {S}_i}}\left\| I_m(\textbf{y}^m_i-\textbf{y}^{m,av}_i)\right\| \right] . \end{aligned}$$

Hence,

$$\begin{aligned} \begin{aligned}&\left[ \sum _{m\in {\mathcal {S}_i}}\left\| \tilde{\textbf{x}}^m_i-\tilde{\textbf{x}}^{m,av}_i\right\| ^2\right] ^{1/2}\le \frac{2L_{i,n}}{\tau _{i,n}}\left( \sum _{m\in {\mathcal {S}_i}}\Vert \bar{\textbf{x}}^m-\textbf{x}^m_i\Vert \right) \\&+\sum _{m\in {\mathcal {S}_i}}\frac{I_m}{\tau _{i,n}}\big (\Vert \textbf{1}_{\mathcal {N}_m}\otimes (\bar{\textbf{r}}^m-\bar{\textbf{r}}^{m,av})\Vert +\Vert \textbf{y}^m-\textbf{y}^{m,av}-\textbf{1}_{\mathcal {N}_m}\otimes (\bar{\textbf{r}}^m-\bar{\textbf{r}}^{m,av})\Vert \big ). \end{aligned} \end{aligned}$$
(29)

Since \(\Vert \tilde{\textbf{x}}^m_i-\tilde{\textbf{x}}^{m,av}_i\Vert \) is not larger than \(\left[ \sum _{m\in {\mathcal {S}_i}}\left\| \tilde{\textbf{x}}^m_i-\tilde{\textbf{x}}^{m,av}_i\right\| ^2\right] ^{1/2}\), (29) implies the former goes to zero as n goes to infinity if we can show all terms in the RHS do so. The first term does go to zero as we showed in part (b) (combining (19), [19] Lemma 7 (a), and the fact that \(\lim _{n\rightarrow \infty }\alpha [n]\left( \frac{L^{\max }_n}{\tau ^{\min }_n}\right) ^2\)). The following shows this property holds for the remaining two terms as well. As always we omit all time index [n] from above as the context is clear enough.

Using (N1) and Proposition B.2 (b), we have

$$\begin{aligned} \frac{1}{\tau _{i,n}}\Vert \textbf{1}_{\mathcal {N}_m}\otimes (\bar{\textbf{r}}^m[n]-\bar{\textbf{r}}^{m,av}[n])\Vert \le \sum _{i\in {\mathcal {N}_m}}\frac{L_{i,n}}{\tau _{i,n}}\Vert \textbf{x}_i[n]-\bar{\textbf{x}}^{\mathcal {S}_i}[n]\Vert \xrightarrow {n\rightarrow \infty }0; \end{aligned}$$
(30)

moreover, using (23), (24), (26), (N1), (T1), (T2), Proposition B.2 (b), [19] Lemma 7 (a), and (T3),

$$\begin{aligned}&\frac{1}{\tau _{i,n}}\Vert \textbf{y}^m-\textbf{y}^{m,av}-\textbf{1}_{\mathcal {N}_m}\otimes (\bar{\textbf{r}}^m-\bar{\textbf{r}}^{m,av})\Vert&\\&\quad \le c_1\frac{\rho ^n}{\tau ^{\min }_n}+c_4\sum _{l=1}^{n-1}\frac{\rho ^{n-l}}{\tau ^{\min }_n}\sum _{i\in {\mathcal {N}_m}}\Bigl (L^{\max }_l\Vert \textbf{x}^m_i[l]-\bar{\textbf{x}}^m[l]\Vert +L^{\max }_{l-1}\Vert \textbf{x}^m_i[l-1]-\bar{\textbf{x}}^m[l-1]\Vert \\&\quad \quad +\Vert \nabla _{\textbf{x}^m}f^*_{i,l}(\textbf{x}_i[l-1])-\nabla _{\textbf{x}^m}f^*_{i,l-1}(\textbf{x}_i[l-1])\Vert \\&\quad \quad +\Vert \nabla _{\textbf{x}^m}f^*_{i,l}(\bar{\textbf{x}}[l-1])-\nabla _{\textbf{x}^m}f^*_{i,l-1}(\bar{\textbf{x}}[l-1])\Vert \Bigr )\\&\quad \quad +c_5\frac{1}{\tau ^{\min }_n}\sum _{i\in {\mathcal {N}_m}}\Bigl (L^{\max }_n\Vert \textbf{x}^m_i[n]-\bar{\textbf{x}}^m[n]\Vert +L^{\max }_{n-1}\Vert \textbf{x}^m_i[n-1]-\bar{\textbf{x}}^m[n-1]\Vert \\&\quad \quad +\Vert \nabla _{\textbf{x}^m}f^*_{i,l}(\textbf{x}_i[n-1])-\nabla _{\textbf{x}^m}f^*_{i,l-1}(\textbf{x}_i[n-1])\Vert \\&\quad \quad +\Vert \nabla _{\textbf{x}^m}f^*_{i,l}(\bar{\textbf{x}}[n-1])-\nabla _{\textbf{x}^m}f^*_{i,l-1}(\bar{\textbf{x}}[n-1])\Vert \Bigr )\\&\quad \xrightarrow {n\rightarrow \infty }0. \end{aligned}$$

For the terms of the form \(\frac{L_n}{\tau _n}\Vert \textbf{x}_\perp [n]\Vert \) to converge to zero, refer to (19) and Assumption (T1) and (T2).

For the second part of the claim,

$$\begin{aligned}&\sum _{n=1}^\infty \alpha [n]L^{\max }_n\left\| \tilde{\textbf{x}}^m_i[n]-\tilde{\textbf{x}}^{m,av}_i[n]\right\| \le c_6\sum _{n=1}^\infty \frac{\alpha [n](L^{\max }_n)^2}{\tau ^{\min }_n}\Vert \textbf{x}^{\mathcal {S}_i}_\perp [n]\Vert +c_1\sum _{n=1}^\infty \frac{\rho ^n\alpha [n]L^{\max }_n}{\tau ^{\min }_n}\\&\quad +c_4\sum _{n=1}^\infty \sum _{l=1}^{n-1}\frac{\rho ^{n-l}\alpha [n]L^{\max }_n}{\tau ^{\min }_n}\sum _{i\in {\mathcal {N}_m}}\Bigl (L^{\max }_l\Vert \textbf{x}^m_i[l]-\bar{\textbf{x}}^m[l]\Vert +L^{\max }_{l-1}\Vert \textbf{x}^m_i[l-1]-\bar{\textbf{x}}^m[l-1]\Vert \\&\quad +\Vert \nabla _{\textbf{x}^m}f^*_{i,l}(\textbf{x}_i[l-1])-\nabla _{\textbf{x}^m}f^*_{i,l-1}(\textbf{x}_i[l-1])\Vert \\&\quad +\Vert \nabla _{\textbf{x}^m}f^*_{i,l}(\bar{\textbf{x}}[l-1])-\nabla _{\textbf{x}^m}f^*_{i,l-1}(\bar{\textbf{x}}[l-1])\Vert \Bigr )\\&\quad +c_5\sum _{n=1}^\infty \frac{\alpha [n]L^{\max }_n}{\tau ^{\min }_n}\sum _{i\in {\mathcal {N}_m}}\Bigl (L^{\max }_n\Vert \textbf{x}^m_i[n]-\bar{\textbf{x}}^m[n]\Vert +L^{\max }_{n-1}\Vert \textbf{x}^m_i[n-1]-\bar{\textbf{x}}^m[n-1]\Vert \\&\quad +\Vert \nabla _{\textbf{x}^m}f^*_{i,l}(\textbf{x}_i[n-1])-\nabla _{\textbf{x}^m}f^*_{i,l-1}(\textbf{x}_i[n-1])\Vert \\&\quad +\Vert \nabla _{\textbf{x}^m}f^*_{i,l}(\bar{\textbf{x}}[n-1])-\nabla _{\textbf{x}^m}f^*_{i,l-1}(\bar{\textbf{x}}[n-1])\Vert \Bigr )\\&<\infty . \end{aligned}$$

For the first term, use (19), (T4), and (T5). The second term is finite due to (T4) with additional \(\alpha [n]\). The terms in the fourth line are just like the first term. The terms in the fifth line converge by the condition of Theorem 3. The terms in the second line are of the type \(\sum _n\frac{\alpha [n]L_n}{\tau _n}\sum _l\rho ^{n-l}L_l\Vert \textbf{x}_\perp [l]\Vert \), from (19) and (T4) one can show that \(\sum _n\frac{\alpha [n]L^2_n}{\tau _n}\Vert \textbf{x}_\perp [n]\Vert \) converges, hence the convergence of the terms by applying second part of Lemma C.1. The terms in the third line converge because of (T6).

1.7 B.7 Proof of Theorem 3

Denote \(F^*_n=\sum _{i\in \mathcal {N}}f^*_{i,n}\). By descent Lemma,

$$\begin{aligned} \begin{aligned}&F^*_n(\bar{\textbf{x}}[n+1])\le F^*_n(\bar{\textbf{x}}[n])+\sum _m\Bigg [\frac{\alpha [n]}{I_m}\nabla _{\textbf{x}^m}F^*_n(\bar{\textbf{x}}[n])^\top \sum _{i\in \mathcal {N}_m} \bigg [\left( \hat{\textbf{x}}^m_{i,n}(\bar{\textbf{x}}[n])-\bar{\textbf{x}}^m_i[n]\right) \\&\quad \quad +\left( \tilde{\textbf{x}}^{m,av}_i[n]-\hat{\textbf{x}}^m_{i,n}(\bar{\textbf{x}}[n])\right) +\left( \tilde{\textbf{x}}^m_i[n]-\tilde{\textbf{x}}^{m,av}_i[n]\right) +\left( \textbf{x}^{m,inx}_i[n]-\tilde{\textbf{x}}^m_i[n]\right) \bigg ]\\&\quad \quad +\frac{L^{\max }_n}{2}\Vert \bar{\textbf{x}}^m[n+1]-\bar{\textbf{x}}^m[n]\Vert ^2\Bigg ]. \end{aligned} \end{aligned}$$
(31)

By the convexity of G (A2),

$$\begin{aligned} \begin{aligned} G(\bar{\textbf{x}}^c[n+1])&\le (1-\alpha [n])G(\bar{\textbf{x}}^c[n])+\alpha [n]G\left( \frac{1}{I}\sum _{i=1}^I\textbf{x}^{c,inx}_i[n]\right) \\&\quad \le (1-\alpha [n])G(\bar{\textbf{x}}^c[n])+\frac{\alpha [n]}{I}\sum _{i=1}^IG(\textbf{x}^{c,inx}_i[n]). \end{aligned} \end{aligned}$$
(32)

Then using Proposition B.1 (a), (32), and (A2),

$$\begin{aligned} \begin{aligned}&\sum _m\frac{\alpha [n]}{I_m}\nabla _{\textbf{x}^m}F^*_n(\bar{\textbf{x}}[n])^\top \sum _{i\in \mathcal {N}_m}\left( \hat{\textbf{x}}^m_{i,n}(\bar{\textbf{x}}[n])-\bar{\textbf{x}}^m_i[n]\right) \\&\quad \le -\tau ^{\min }_n\alpha [n]\diamondsuit [n]+G(\bar{\textbf{x}}^c[n])-G(\bar{\textbf{x}}^c[n+1])+\frac{L_G\alpha [n]}{I}\sum _{i=1}^I\left\| \textbf{x}^{c,inx}_i[n]-\hat{\textbf{x}}^c_{i,n}(\bar{\textbf{x}}[n])\right\| ,\\ \end{aligned} \end{aligned}$$
(33)

where \(\diamondsuit [n]\) stands for the expression \(\sum _m\sum _{i\in \mathcal {N}_m}\left\| \hat{\textbf{x}}^m_i(\bar{\textbf{x}}[n])-\bar{\textbf{x}}^m_i[n]\right\| ^2\). Combining (31), (33) and (N1) with Cauchy-Schwarz inequality as well as triangle inequality, we get

$$\begin{aligned} \begin{aligned}&F^*_n(\bar{\textbf{x}}[n+1]) \le F^*_n(\bar{\textbf{x}}[n])+G(\bar{\textbf{x}}^c[n])-G(\bar{\textbf{x}}^c[n+1])+\frac{\alpha [n]L_G}{I}\sum _{i=1}^I\left\| \textbf{x}^{c,inx}_i[n]-\hat{\textbf{x}}^c_{i,n}(\bar{\textbf{x}}^c[n])\right\| \\&\qquad +\sum _m\left[ \frac{\alpha [n]L^{\max }_n}{I_m}\sum _{i\in \mathcal {N}_m}\left( \left\| \tilde{\textbf{x}}^{m,av}_i[n]-\hat{\textbf{x}}^m_{i,n}(\bar{\textbf{x}}[n])\right\| +\left\| \tilde{\textbf{x}}^m_i[n]-\tilde{\textbf{x}}^{m,av}_i[n]\right\| +\epsilon ^m_i[n]\right) \right] \\&\qquad +\sum _m\frac{L^{\max }_n}{2}\Vert \bar{\textbf{x}}^m[n+1]-\bar{\textbf{x}}^m[n]\Vert ^2-\tau ^{\min }_n\alpha [n]\diamondsuit [n]. \end{aligned} \end{aligned}$$
(34)

From the triangle inequality, Proposition B.2 (a) and Fact 11 (e),

$$\begin{aligned} \begin{aligned} \left\| \textbf{x}^{c,inx}_i[n]-\hat{\textbf{x}}^c_{i,n}(\bar{\textbf{x}}^c[n])\right\|&\le \left\| \textbf{x}^{c,inx}_i[n]-\tilde{\textbf{x}}^c_i[n]\right\| +\left\| \tilde{\textbf{x}}^c_i[n]-\tilde{\textbf{x}}^{c,av}_i[n]\right\| \\&\hspace{70pt}+\left\| \tilde{\textbf{x}}^{c,av}_i[n]-\hat{\textbf{x}}^c_{i,n}(\bar{\textbf{x}}^c[n])\right\| ,\\ \Vert \bar{\textbf{x}}^m[n+1]-\bar{\textbf{x}}^m[n]\Vert ^2&\le \left( \frac{c^m\alpha [n]L^{\max }_n}{I_m\tau ^{\min }_n}\right) ^2\quad \forall \hspace{5.0pt}m. \end{aligned} \end{aligned}$$

Substitute these expressions back into (34) and rearrange the terms to get

$$\begin{aligned} \begin{aligned}&U^*_{n+1}(\bar{\textbf{x}}[n+1]) \le U^*_n(\bar{\textbf{x}}[n])-\tau ^{\min }_n\alpha [n]\diamondsuit [n]+c_1(L^{\max }_n)^3\left( \frac{\alpha [n]}{\tau ^{\min }_n}\right) ^2\\&\qquad +F^*_{n+1}(\bar{\textbf{x}}[n+1])-F^*_n(\bar{\textbf{x}}[n+1])\\&\qquad +c_2\sum _m\left[ \alpha [n]L^{\max }_n\sum _{i\in \mathcal {N}_m}\left( \left\| \tilde{\textbf{x}}^{m,av}_i[n]-\hat{\textbf{x}}^m_{i,n}(\bar{\textbf{x}}[n])\right\| +\left\| \tilde{\textbf{x}}^m_i[n]-\tilde{\textbf{x}}^{m,av}_i[n]\right\| +\epsilon ^m_i[n]\right) \right] . \end{aligned} \end{aligned}$$
(35)

We now exploit [19] Lemma 8 with \(Y[n]=U^*_n(\bar{\textbf{x}}[n])\), \(X[n]=\tau ^{\min }_n\alpha [n]\diamondsuit [n]\) and

$$\begin{aligned} \begin{aligned}&Z[n]=c_1(L^{\max }_n)^3\left( \frac{\alpha [n]}{\tau ^{\min }_n}\right) ^2+F^*_{n+1}(\bar{\textbf{x}}[n+1])-F^*_n(\bar{\textbf{x}}[n+1])\\&\qquad +c_2\sum _m\left[ \alpha [n]L^{\max }_n\sum _{i\in \mathcal {N}_m}\left( \left\| \tilde{\textbf{x}}^{m,av}_i[n]-\hat{\textbf{x}}^m_{i,n}(\bar{\textbf{x}}[n])\right\| +\left\| \tilde{\textbf{x}}^m_i[n]-\tilde{\textbf{x}}^{m,av}_i[n]\right\| +\epsilon ^m_i[n]\right) \right] . \end{aligned} \end{aligned}$$

Since \(U(\bar{\textbf{x}}[n])\) is coercive ((A3)), \(Y[n]\not \rightarrow -\infty \); on the other hand, from Proposition B.2 (c), (d), and the assumption of the Theorem, \(\sum _{n=1}^\infty Z[n]<\infty \). Thus, by [19] Lemma 8 \(\{U^*_n(\bar{\textbf{x}}[n])\}\) converges to a finite value and \(\sum _{n=1}^\infty \tau ^{\min }_n\alpha [n]\diamondsuit [n]\) converges as well, which means \(\sum _{n=1}^\infty \tau ^{\min }_n\alpha [n]\left\| \hat{\textbf{x}}^m_{i,n}(\bar{\textbf{x}}[n])-\bar{\textbf{x}}^m[n]\right\| ^2<\infty \) for all \(i\in \mathcal {N}_m\) and \(m\in [M+1]\). This in turn implies \(\lim _{n\rightarrow \infty }\left\| \hat{\textbf{x}}^m_{i,n}(\bar{\textbf{x}}[n])-\bar{\textbf{x}}^m[n]\right\| =0\) for all \(i\in \mathcal {N}_m\) and \(m\in [M+1]\). At this point, the localization is no longer an issue, and we will use the generalized definition of \(\hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}[n])\in \mathcal {K}\) so that \(\lim _{n\rightarrow \infty }\left\| \hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}[n])-\bar{\textbf{x}}[n]\right\| =0\) for all \(i\in \mathcal {N}\).

Since \(\{\bar{\textbf{x}}[n]\}\) is bounded following from the convergence of \(\{U^*_n(\bar{\textbf{x}}[n])\}\), there exists a limit point \(\bar{\textbf{x}}^\infty \in \mathcal {K}\) of the set. We assume \(\bar{\textbf{x}}[n]\rightarrow \bar{\textbf{x}}^\infty \). If this is not the case, then one can find a subsequence \(\bar{\textbf{x}}[n_k]\) indexed by k such that \(\bar{\textbf{x}}[n_k]\rightarrow \bar{\textbf{x}}^\infty \) as \(k\rightarrow \infty \). We consider a partition of three cases: (1) bounded gradient (\(\exists \hspace{5.0pt}B\text { s.t. }\Vert \nabla f_i(\textbf{x})\Vert <B\hspace{5.0pt}\forall \hspace{5.0pt}i,\textbf{x}\)), (2) unbounded gradient and interior point (\(\bar{\textbf{x}}^\infty \in int(\mathcal {K})\)), and (3) unbounded gradient and boundary point (\(\bar{\textbf{x}}^\infty \in bd(\mathcal {K})\)).

(1) bounded gradient: Recall the map defined in Proposition B.1

$$\begin{aligned} \hat{\textbf{x}}_{i,n}(\tilde{\textbf{x}})=\underset{\textbf{x}}{\arg \min }\hspace{5.0pt}\tilde{f}^*_{i,n}(\textbf{x};\tilde{\textbf{x}}) +\mathbf {\pi }_i(\tilde{\textbf{x}})^\top (\textbf{x}-\tilde{\textbf{x}})+G(\textbf{x})\triangleq \underset{\textbf{x}}{\arg \min }\hspace{5.0pt}\tilde{U}_{i,n}(\textbf{x};\tilde{\textbf{x}}). \end{aligned}$$

This map is converging to the following map

$$\begin{aligned} \hat{\textbf{x}}_{i}(\tilde{\textbf{x}})=\underset{\textbf{x}}{\arg \min }\hspace{5.0pt}\tilde{f}_i(\textbf{x};\tilde{\textbf{x}})+\mathbf {\pi }_i(\tilde{\textbf{x}})^\top (\textbf{x}-\tilde{\textbf{x}})+G(\textbf{x})\triangleq \underset{\textbf{x}}{\arg \min }\hspace{5.0pt}\tilde{U}_{i}(\textbf{x};\tilde{\textbf{x}}), \end{aligned}$$
(36)

which might be multi-valued since we do not require \(\tilde{f}_i\) to be strongly convex. The latter map is well-defined everywhere only with a bounded gradient. Otherwise \(\mathbf {\pi }_i\) could be infinite; moreover, when \(\nabla f_i(\textbf{x})=\infty \) and \(\textbf{x}\in int(\mathcal {K})\), it is not possible to achieve \(\nabla \tilde{f}_i(\textbf{x};\textbf{x})=\nabla f_i(\textbf{x})\), \(\tilde{f}_i\) being defined everywhere and being convex simultaneously. Thus, the analysis for this case does not work for the other two cases.

Now consider the two maps evaluated at \(\bar{\textbf{x}}[n]\) and \(\bar{\textbf{x}}^\infty \), respectively, \(\hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}[n])\), the minimizer of \(\tilde{U}_{i,n}(\bullet ;\bar{\textbf{x}}[n])\triangleq \psi _n\), and \(\hat{\textbf{x}}_{i}(\bar{\textbf{x}}^\infty )\), the set of minimizers of \(\tilde{U}_{i}(\bullet ;\bar{\textbf{x}}^\infty )\triangleq \psi \). We have the following two properties.

  • \(\{\psi _n\}\) is eventually level-bounded, i.e. \(\forall \hspace{5.0pt}\alpha \in \mathbb {R}\), \(\bigcup _{n\in N,N\in \mathcal {N}_\infty }lev_{\le \alpha }\psi _n\) is bounded. Refer to [25], p. 8, p. 109, and p. 123 for the definitions of the notations. This is ensured by Assumption F3, i.e. either \(\tilde{f}_i(\bullet ;\textbf{x})\) is coercive \(\forall \hspace{5.0pt}\textbf{x},i\) or \(G(\bullet )\) is coercive.

  • \(\psi _n\overset{e}{\rightarrow }\psi \), i.e., \(\psi _n\) epi-converges to \(\psi \). See [25], p. 241 for the definition. This is due to \(\{\tilde{U}_{i,n}\}\) and \(\tilde{U}_i\) being continuous and \(\lim _{n\rightarrow \infty }\tilde{U}_{i,n}=\tilde{U}_i\), then by [25] Theorem 7.2, p. 241 we have \(\psi _n\overset{e}{\rightarrow }\psi \).

By [25] Theorem 7.33, p. 266, with these two properties, we then have

$$\begin{aligned} \bar{\textbf{x}}^\infty =\lim _{n\rightarrow \infty }\hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}[n])=\underset{n\rightarrow \infty }{\lim \sup }(\arg \min \psi _n) \subset \arg \min \psi =\hat{\textbf{x}}_{i}(\bar{\textbf{x}}^\infty ). \end{aligned}$$

In [19] Proposition 5(b) says that the fixed point of \(\hat{\textbf{x}}_{i}\) is also the stationary point of the original optimization problem, which is proved in [10] Proposition 8(b). Things change slightly here as the minimizer of \(\hat{\textbf{x}}_{i}\) may not be unique. However, in the proof they did not exploit any strong convexity property. Hence, we still have \(\bar{\textbf{x}}^\infty \) being a stationary point.

(2) unbounded gradient and interior point: Effectively we want to show

$$\begin{aligned} \nabla F(\bar{\textbf{x}}^\infty )^\top (\textbf{z}-\bar{\textbf{x}}^\infty )+G(\textbf{z})-G(\bar{\textbf{x}}^\infty )\ge 0\quad \forall \hspace{5.0pt}\textbf{z}\in \mathcal {K}, \end{aligned}$$
(37)

but we can no longer argue anything with \(\hat{\textbf{x}}_i\). Only for the following, we will write \(\bar{\textbf{x}}_n\) instead of \(\bar{\textbf{x}}[n]\) for simplicity. From the optimality condition of \(\hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}_n)\), we have that for all \(\textbf{z}\in \mathcal {K}\),

$$\begin{aligned} 0&\le \left[ \nabla \tilde{f}^*_{i,n}(\hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}_n);\bar{\textbf{x}}_n)+\sum _{j\ne i}\nabla f^*_{i,n}(\bar{\textbf{x}}_n)\right] ^\top \left( \textbf{z}-\hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}_n)\right) +G(z)-G\left( \hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}_n)\right) \nonumber \\&=\left[ \nabla \tilde{f}^*_{i,n}(\bar{\textbf{x}}_n;\bar{\textbf{x}}_n)+\sum _{j\ne i}\nabla f^*_{i,n}(\bar{\textbf{x}}_n)\right] ^\top \left( \textbf{z}-\hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}_n)\right) +G(z)-G\left( \hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}_n)\right) \nonumber \\&\qquad +\left[ \nabla \tilde{f}^*_{i,n}(\hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}_n);\bar{\textbf{x}}_n)-\nabla \tilde{f}^*_{i,n}(\bar{\textbf{x}}_n;\bar{\textbf{x}}_n)\right] ^\top \left( \textbf{z}-\hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}_n)\right) , \end{aligned}$$
(38)

where \(\nabla \tilde{f}^*_{i,n}(\bar{\textbf{x}}_n;\bar{\textbf{x}}_n)+\sum _{j\ne i}\nabla f^*_{i,n}(\bar{\textbf{x}}_n)\) is just \(\nabla F^*_n(\bar{\textbf{x}}_n)\). The terms in the second bracket are bounded as follows

$$\begin{aligned}&\left\| \nabla \tilde{f}^*_{i,n}(\hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}_n);\bar{\textbf{x}}_n)-\nabla \tilde{f}^*_{i,n}(\bar{\textbf{x}}_n;\bar{\textbf{x}}_n)\right\| \\&\quad \le \left\| \nabla \tilde{f}^*_{i,n}(\hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}_n);\bar{\textbf{x}}_n)-\nabla \tilde{f}^*_{i,n}(\hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}_n);\hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}_n))\right\| \\&\qquad +\left\| \nabla \tilde{f}^*_{i,n}(\hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}_n);\hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}_n))-\nabla \tilde{f}^*_{i,n}(\bar{\textbf{x}}_n;\bar{\textbf{x}}_n)\right\| \\&\quad =\left\| \nabla \tilde{f}^*_{i,n}(\hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}_n);\bar{\textbf{x}}_n)-\nabla \tilde{f}^*_{i,n}(\hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}_n);\hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}_n))\right\| +\left\| \nabla f^*_{i,n}(\hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}_n))-\nabla f^*_{i,n}(\bar{\textbf{x}}_n)\right\| \\&\quad \le L_{i,n}\left\| \hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}_n)-\bar{\textbf{x}}_n\right\| +L_{i,n}\left\| \hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}_n)-\bar{\textbf{x}}_n\right\| \le 2L^{\max }_n\left\| \hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}_n)-\bar{\textbf{x}}_n\right\| , \end{aligned}$$

where the second inequality is due to the Lipschitz continuities of \(\nabla \tilde{f}^*_{i,n}(\textbf{x};\bullet )\) and \(\nabla f^*_{i,n}(\bullet )\). Since we assume \(\sum _n(L^{\max }_n)^3\left( \frac{\alpha [n]}{\tau ^{\min }_n}\right) ^2<\infty \) in the condition and get \(\sum _n\alpha [n]\tau ^{\min }_n\Vert \hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}_n)-\bar{\textbf{x}}_n\Vert ^2<\infty \), it must be that \(\Vert \hat{\textbf{x}}_{i,n}(\bar{\textbf{x}}_n)-\bar{\textbf{x}}_n\Vert ^2=O\left( \alpha [n]\frac{(L^{\max }_n)^3}{(\tau ^{\min }_n)^3}\right) \). Hence, with the conditions of \(\lim _{n\rightarrow \infty }\alpha [n]\frac{(L^{\max }_n)^5}{(\tau ^{\min }_n)^3}=0\) and \(\lim _{n\rightarrow \infty }\nabla F^*_n=\nabla F\), taking \(n\rightarrow \infty \) in (38) yields exactly (37). It is evident that \(\bar{\textbf{x}}^\infty \) must be a point such that \(\nabla F(\bar{\textbf{x}}^\infty )<\infty \), because if not so \(\bar{\textbf{x}}^\infty \) is an interior point and there must exist one descent direction.

(3) unbounded gradient and boundary point: We can consider two subcases.

  • \(\nabla F(\bar{\textbf{x}}^\infty )<\infty \): we can use the same argument in case (2) to show that \(\bar{\textbf{x}}^\infty \) is a stationary point. If we have \(\Vert \nabla f_i(\bar{\textbf{x}}^\infty )\Vert <B\hspace{5.0pt}\forall \hspace{5.0pt}i\), we can also use the same argument in case (1) confined to a small neighborhood of \(\bar{\textbf{x}}^\infty \).

  • \(\nabla F(\bar{\textbf{x}}^\infty )=\infty \): the definition of stationary point fails here and we can only turn to the definition of local minimum. However, both NEXT and our algorithm can numerically converge to a point that is not a local minimum.

Appendix C: Technical Lemmas

We put technical lemmas used in the proof in this appendix. The proof of Lemma C.1 can be found in [15].

Fact 11

For all \(m\in [M+1]\) we have the following.

  1. (a)

    \(J^m_\perp \hat{\textbf{W}}^m[n]=\hat{\textbf{W}}^m[n]-\left( \hat{\textbf{W}}^m[n]\cdot \frac{1}{I_m}\textbf{1}_{\mathcal {N}_m}\textbf{1}_{\mathcal {N}_m}^\top \right) \otimes \textbf{I}_{d_m}\) \(\quad =\left( \textbf{I}_{d_mI_m}-\textbf{1}_{\mathcal {N}_m}\textbf{1}_{\mathcal {N}_m}^\top \otimes \textbf{I}_{d_m}\right) \hat{\textbf{W}}^m[n]\left( \textbf{I}_{d_mI_m} -\textbf{1}_{\mathcal {N}_m}\textbf{1}_{\mathcal {N}_m}^\top \otimes \textbf{I}_{d_m}\right) =J^m_\perp \hat{\textbf{W}}^m[n]J^m_\perp .\)

  2. (b)

    \(J^m_\perp \hat{\textbf{P}}^m[n,l]=\left( \textbf{P}^m[n,l]-\frac{1}{I_m}\textbf{1}_{\mathcal {N}_m}\textbf{1}_{\mathcal {N}_m}^\top \right) \otimes \textbf{I}_{d_m}\).

  3. (c)

    \(\bar{\textbf{q}}^m\triangleq \frac{1}{I_m}\sum _{i\in {\mathcal {N}_m}}q_i=\frac{\textbf{1}_{\mathcal {N}_m}^\top \otimes \textbf{I}_{d_m}}{I_m}\textbf{q}\) where \(\textbf{q}=[q_1^\top \hspace{5.0pt}\cdots \hspace{5.0pt}q_I^\top ]^\top \) and \(q_1,\dots ,q_I\) are all arbitrary in \(\mathbb {R}^{d_m}\).

  4. (d)

    \(\textbf{x}^m[n]=\hat{\textbf{W}}^m[n-1]\textbf{x}^m[n-1]+\alpha [n-1]\hat{\textbf{W}}^m[n-1]\varDelta \textbf{x}^{m,inx}[n-1]\) where \(\varDelta \textbf{x}^{m,inx}[n]=\left( \mathbb {I}\{i\in {\mathcal {N}_m}\}(\textbf{x}^{m,inx}_i[n]-\textbf{x}^m_i[n])\right) _{i\in \mathcal {N}}\).

  5. (e)

    \(\bar{\textbf{x}}^m[n]=\bar{\textbf{x}}^m[n-1]+\frac{\alpha [n-1]}{I_m}\left( \textbf{1}_{\mathcal {N}_m}^\top \otimes \textbf{I}_{d_m}\right) \varDelta \textbf{x}^{m,inx}[n-1]\).

Lemma C.1

Let \(0<\lambda <1\), and let \(\{\beta [n]\}\) and \(\{\nu [n]\}\) be two positive scalar sequences such that \(\beta [n]\rightarrow 0\), \(\nu [n]\rightarrow \infty \), and \(\beta [n]\nu [n]\rightarrow 0\). If further there exist \(1>\tilde{\lambda }>\lambda \) and N such that \(\frac{\beta [n]}{\beta [l]}\ge \tilde{\lambda }^{n-l}\) for all \(n\ge l\ge N\), then \(\lim _{n\rightarrow \infty }\nu [n]\sum _{l=1}^n\lambda ^{n-l}\beta [l]=0\). Moreover, if \(\beta [n]\nu [n]\) is summable, then so is \(\nu [n]\sum _{l=1}^n\lambda ^{n-l}\beta [l]\).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kao, H., Subramanian, V. Localization and Approximations for Distributed Non-convex Optimization. J Optim Theory Appl 200, 463–500 (2024). https://doi.org/10.1007/s10957-023-02328-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10957-023-02328-8

Keywords

Mathematics Subject Classification

Navigation