Abstract
This paper analyzes block-coordinate proximal gradient methods for minimizing the sum of a separable smooth function and a (nonseparable) nonsmooth function, both of which are allowed to be nonconvex. The main tool in our analysis is the forward-backward envelope, which serves as a particularly suitable continuous and real-valued Lyapunov function. Global and linear convergence results are established when the cost function satisfies the Kurdyka–Łojasiewicz property without imposing convexity requirements on the smooth function. Two prominent special cases of the investigated setting are regularized finite sum minimization and the sharing problem; in particular, an immediate byproduct of our analysis leads to novel convergence results and rates for the popular Finito/MISO algorithm in the nonsmooth and nonconvex setting with very general sampling strategies.
This is a preview of subscription content,
to check access.Similar content being viewed by others
References
Attouch, H., Bolte, J.: On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math. Program. 116(1–2), 5–16 (2009)
Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka–Łojasiewicz inequality. Math. Oper. Res. 35(2), 438–457 (2010)
Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Math. Program. 137(1), 91–129 (2013)
Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. CMS Books in Mathematics. Springer, Berlin (2017)
Beck, A.: First-Order Methods in Optimization. Society for Industrial and Applied Mathematics, Philadelphia, PA (2017)
Beck, A., Tetruashvili, L.: On the convergence of block coordinate descent type methods. SIAM J. Optim. 23(4), 2037–2060 (2013)
Bertsekas, D.P.: Incremental proximal methods for large scale convex optimization. Math. Program. 129(2), 163–195 (2011)
Bertsekas, D.P.: Nonlinear Programming. Athena Scientific, Singapore (2016)
Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and Distributed Computation: Numerical Methods, vol. 23. Prentice-Hall, Upper Saddle River (1989)
Bianchi, P., Hachem, W., Iutzeler, F.: A coordinate descent primal-dual algorithm and application to distributed asynchronous optimization. IEEE Trans. Autom. Control 61(10), 2947–2957 (2016)
Bolte, J., Daniilidis, A., Lewis, A.: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17(4), 1205–1223 (2007)
Bolte, J., Daniilidis, A., Lewis, A., Shiota, M.: Clarke subgradients of stratifiable functions. SIAM J. Optim. 18(2), 556–572 (2007)
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)
Chouzenoux, E., Pesquet, J.C., Repetti, A.: A block coordinate variable metric forward-backward algorithm. J. Glob. Optim. 66(3), 457–485 (2016)
Chow, Y.T., Wu, T., Yin, W.: Cyclic coordinate-update algorithms for fixed-point problems: analysis and applications. SIAM J. Sci. Comput. 39(4), A1280–A1300 (2017)
Clarke, F.H.: Optimization and nonsmooth analysis. Soc. Ind. Appl. Math. (1990)
Combettes, P.L., Pesquet, J.C.: Stochastic quasi-Fejér block-coordinate fixed point iterations with random sweeping. SIAM J. Optim. 25(2), 1221–1248 (2015)
Davis, D.: Smart: The stochastic monotone aggregated root-finding algorithm. arXiv:1601.00698 (2016)
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems, pp. 1646–1654 (2014)
Defazio, A., Domke, J.: Finito: A faster, permutable incremental gradient method for big data problems. In: International Conference on Machine Learning, pp. 1125–1133 (2014)
Facchinei, F., Pang, J.S.: Finite-Dimensional Variational Inequalities and Complementarity Problems, vol. II. Springer, Berlin (2003)
Fercoq, O., Bianchi, P.: A coordinate-descent primal-dual algorithm with large step size and possibly nonseparable functions. SIAM J. Optim. 29(1), 100–134 (2019)
Frankel, P., Garrigos, G., Peypouquet, J.: Splitting methods with variable metric for Kurdyka–Łojasiewicz functions and general convergence rates. J. Optim. Theory Appl. 165(3), 874–900 (2015)
Fukushima, M., Mine, H.: A generalized proximal point algorithm for certain non-convex minimization problems. Int. J. Syst. Sci. 12(8), 989–1000 (1981)
Hanzely, F., Mishchenko, K., Richtárik, P.: SEGA: Variance reduction via gradient sketching. In: Advances in Neural Information Processing Systems, pp. 2082–2093 (2018)
Hong, M., Wang, X., Razaviyayn, M., Luo, Z.Q.: Iteration complexity analysis of block coordinate descent methods. Math. Program. 163(1–2), 85–114 (2017)
Hou, Y., Song, I., Min, H.K., Park, C.H.: Complexity-reduced scheme for feature extraction with linear discriminant analysis. IEEE Trans. Neural Netw. Learn. Syst. 23(6), 1003–1009 (2012)
Iutzeler, F., Bianchi, P., Ciblat, P., Hachem, W.: Asynchronous distributed optimization using a randomized alternating direction method of multipliers. In: 52nd IEEE Conference on Decision and Control (CDC), pp. 3671–3676 (2013)
Kurdyka, K.: On gradients of functions definable in \(o\)-minimal structures. Annales de l’institut Fourier 48(3), 769–783 (1998)
Latafat, P., Freris, N.M., Patrinos, P.: A new randomized block-coordinate primal-dual proximal algorithm for distributed optimization. IEEE Trans. Autom. Control 64(10), 4050–4065 (2019)
Li, G., Pong, T.K.: Douglas–Rachford splitting for nonconvex optimization with application to nonconvex feasibility problems. Math. Program. 159(1), 371–401 (2016)
Lin, Q., Lu, Z., Xiao, L.: An accelerated randomized proximal coordinate gradient method and its application to regularized empirical risk minimization. SIAM J. Optim. 25(4), 2244–2273 (2015)
Łojasiewicz, S.: Une propriété topologique des sous-ensembles analytiques réels. Les équations aux dérivées partielles pp. 87–89 (1963)
Łojasiewicz, S.: Sur la géométrie semi- et sous- analytique. Annales de l’institut Fourier 43(5), 1575–1595 (1993)
Mairal, J.: Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM J. Optim. 25(2), 829–855 (2015)
Mokhtari, A., Gürbüzbalaban, M., Ribeiro, A.: Surpassing gradient descent provably: a cyclic incremental method with linear convergence rate. SIAM J. Optim. 28(2), 1420–1447 (2018)
Necoara, I.: Random coordinate descent algorithms for multi-agent convex optimization over networks. IEEE Trans. Autom. Control 58(8), 2001–2012 (2013)
Necoara, I., Patrascu, A.: A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints. Comput. Optim. Appl. 57(2), 307–337 (2014)
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Springer (2013)
Ochs, P., Chen, Y., Brox, T., Pock, T.: iPiano: inertial proximal algorithm for nonconvex optimization. SIAM J. Imaging Sci. 7(2), 1388–1419 (2014)
Patrinos, P., Bemporad, A.: Proximal Newton methods for convex composite optimization. In: 52nd IEEE Conference on Decision and Control, pp. 2358–2363 (2013)
Peng, Z., Xu, Y., Yan, M., Yin, W.: ARock: an algorithmic framework for asynchronous parallel coordinate updates. SIAM J. Sci. Comput. 38(5), A2851–A2879 (2016)
Pesquet, J.C., Repetti, A.: A class of randomized primal-dual algorithms for distributed optimization. J. Nonlinear Convex Anal. 16(12), 2453–2490 (2015)
Qian, X., Sailanbayev, A., Mishchenko, K., Richtárik, P.: MISO is making a comeback with better proofs and rates. arXiv:1906.01474 (2019)
Reddi, S.J., Hefny, A., Sra, S., Poczos, B., Smola, A.J.: Stochastic variance reduction for nonconvex optimization. In: International Conference on Machine Learning, pp. 314–323 (2016)
Reddi, S.J., Sra, S., Poczos, B., Smola, A.J.: Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In: Advances in Neural Information Processing Systems, pp. 1145–1153 (2016)
Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144(1–2), 1–38 (2014)
Robbins, H., Siegmund, D.: A convergence theorem for non negative almost supermartingales and some applications. In: Herbert Robbins Selected Papers, pp. 111–135. Springer (1985)
Rockafellar, R.T., Wets, R.J.B.: Variational Analysis, vol. 317. Springer, Berlin (2011)
Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1), 83–112 (2017)
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14(Feb), 567–599 (2013)
Themelis, A.: Proximal algorithms for structured nonconvex optimization. Ph.D. thesis, KU Leuven (2018)
Themelis, A., Ahookhosh, M., Patrinos, P.: On the acceleration of forward-backward splitting via an inexact Newton method. In: Bauschke, H.H., Burachik, R.S., Luke, D.R. (eds.) Splitting Algorithms, Modern Operator Theory, and Applications, pp. 363–412. Springer, Cham (2019)
Themelis, A., Stella, L., Patrinos, P.: Forward-backward envelope for the sum of two nonconvex functions: further properties and nonmonotone linesearch algorithms. SIAM J. Optim. 28(3), 2274–2303 (2018)
Tseng, P.: Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl. 109(3), 475–494 (2001)
Tseng, P., Bertsekas, D.P.: Relaxation methods for problems with strictly convex separable costs and linear constraints. Math. Program. 38(3), 303–321 (1987)
Tseng, P., Yun, S.: Block-coordinate gradient descent method for linearly constrained nonsmooth separable optimization. J. Optim. Theory Appl. 140(3), 513–535 (2008)
Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Math. Program. 117(1), 387–423 (2009)
Tseng, P., Yun, S.: A coordinate gradient descent method for linearly constrained smooth optimization and support vector machines training. Comput. Optim. Appl. 47(2), 179–206 (2010)
Xu, Y., Yin, W.: A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM J. Imaging Sci. 6(3), 1758–1789 (2013)
Xu, Y., Yin, W.: A globally convergent algorithm for nonconvex optimization based on block coordinate update. J. Sci. Comput. 72(2), 700–734 (2017)
Yu, P., Li, G., Pong, T.K.: Deducing Kurdyka-Łojasiewicz exponent via inf-projection. arXiv:1902.03635 (2019)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was supported by the Research Foundation Flanders (FWO) PhD grant 1196820N and research projects G0A0920N, G086518N and G086318N; Research Council KU Leuven C1 Project No. C14/18/068; Fonds de la Recherche Scientifique—FNRS and the Fonds Wetenschappelijk Onderzoek—Vlaanderen under EOS Project No. 30468160 (SeLMA).
A The key tool: the forward-backward envelope
A The key tool: the forward-backward envelope
This appendix contains some proofs and auxiliary results omitted in the main body. We begin by observing that, since \(F\) and \(-F\) are 1-smooth in the metric induced by \( \varLambda _F{:}{=}\tfrac{1}{N}{{\,\mathrm{blkdiag}\,}}(L_{f_1}\mathrm{I}_{n_1},\dots ,L_{f_N}\mathrm{I}_{n_N}) \), one has
for all \({\varvec{x}},{\varvec{w}}\in \mathbb {R}^{\sum _in_i}\), see [8, Prop. A.24]. Let us denote
the quantity being minimized (with respect to \({\varvec{w}}\)) in the definition (2.2a) of the FBE. It follows from (A.1) that
holds for all \({\varvec{x}},{\varvec{w}}\in \mathbb {R}^{\sum _in_i}\). In particular, \({\mathcal {M}}_{\varGamma }\) is a majorizing model for \(\varPhi \), in the sense that \({\mathcal {M}}_{\varGamma }({\varvec{x}},{\varvec{x}})=\varPhi ({\varvec{x}})\) and \({\mathcal {M}}_{\varGamma }({\varvec{w}},{\varvec{x}})\ge \varPhi ({\varvec{w}})\) for all \({\varvec{x}},{\varvec{w}}\in \mathbb {R}^{\sum _in_i}\). In fact, as explained in Sect. 2.1, while a \(\varGamma \)-forward-backward step \({\varvec{z}}\in {\text {T}}_\varGamma ^{\textsc {fb}}({\varvec{x}})\) amounts to evaluating a minimizer of \({\mathcal {M}}_{\varGamma }(\cdot ,{\varvec{x}})\), the FBE is defined instead as the minimization value, namely \(\varPhi _\varGamma ^{\textsc {fb}}({\varvec{x}})={\mathcal {M}}_{\varGamma }({\varvec{z}},{\varvec{x}})\) where \({\varvec{z}}\) is any element of \({\text {T}}_\varGamma ^{\textsc {fb}}({\varvec{x}})\).
1.1 A.1 Proofs of Sect. 2.1
Proof of Lemma 2.1
For \({\varvec{x}}^\star \in {{\,\mathrm{\mathrm{arg\,min}}\,}}\varPhi \) it follows from (A.1) that
Therefore, \(G\) is lower bounded by a quadratic function with quadratic term \(-\tfrac{1}{2}\Vert \cdot \Vert _{\varLambda _F}^2\), and thus is prox-bounded in the sense of [51, Def. 1.23]. The claim then follows from [51, Th. 1.25 and Ex. 5.23(b)] and the continuity of the forward mapping \(\hbox {id} - \varGamma \nabla F\). \(\square \)
Proof of Lemma 2.3
(FBE: fundamental inequalities). Local Lipschitz continuity follows from (2.2d) in light of Lemma 2.1 and [51, Ex. 10.32].
-
2.3(i) Follows by replacing \({\varvec{w}}={\varvec{x}}\) in (2.2a).
-
2.3(ii) Directly follows from (A.2) and the identity \(\varPhi _\varGamma ^{\textsc {fb}}({\varvec{x}})={\mathcal {M}}_{\varGamma }({\varvec{z}},{\varvec{x}})\) for \({\varvec{z}}\in {\text {T}}_\varGamma ^{\textsc {fb}}({\varvec{x}})\). \(\square \)
Proof of Lemma 2.4
(FBE: minimization equivalence).
-
2.4(i) and 2.4(ii) It follows from Lemma 2.3(i) that \(\inf \varPhi _\varGamma ^{\textsc {fb}}\le \min \varPhi \). Conversely, let \(({\varvec{x}}^k)_{{k\in \mathbb {N}}}\) be such that \(\varPhi _\varGamma ^{\textsc {fb}}({\varvec{x}}^k)\rightarrow \inf \varPhi _\varGamma ^{\textsc {fb}}\) as \(k\rightarrow \infty \), and for each \(k\) let \({\varvec{z}}^k\in {\text {T}}_\varGamma ^{\textsc {fb}}({\varvec{x}}^k)\). It then follows from Lemmas 2.3(i) and 2.3(ii) that
$$\begin{aligned} \inf \varPhi _\varGamma ^{\textsc {fb}}\le \min \varPhi \le \liminf _{k\rightarrow \infty }\varPhi ({\varvec{z}}^k) \le \liminf _{k\rightarrow \infty }\varPhi _\varGamma ^{\textsc {fb}}({\varvec{x}}^k) = \inf \varPhi _\varGamma ^{\textsc {fb}}, \end{aligned}$$hence \(\min \varPhi =\inf \varPhi _\varGamma ^{\textsc {fb}}\). Suppose now that \({\varvec{x}}\in {{\,\mathrm{\mathrm{arg\,min}}\,}}\varPhi \) (which exists by Assumption I); then it follows from Lemma 2.3(ii) that \({\text {T}}_\varGamma ^{\textsc {fb}}({\varvec{x}})=\left\{ {\varvec{x}} \right\} \) (for otherwise another element would belong to a lower level set of \(\varPhi \)). Combining with Lemma 2.3(i) with \({\varvec{z}}={\varvec{x}}\) we then have
$$\begin{aligned} \min \varPhi = \varPhi ({\varvec{z}}) \le \varPhi _\varGamma ^{\textsc {fb}}({\varvec{x}}) \le \varPhi ({\varvec{x}}) = \min \varPhi . \end{aligned}$$Since \(\min \varPhi =\inf \varPhi _\varGamma ^{\textsc {fb}}\), we conclude that \({\varvec{x}}\in {{\,\mathrm{\mathrm{arg\,min}}\,}}\varPhi _\varGamma ^{\textsc {fb}}\), and that in particular \(\inf \varPhi _\varGamma ^{\textsc {fb}}=\min \varPhi _\varGamma ^{\textsc {fb}}\). Conversely, suppose \({\varvec{x}}\in {{\,\mathrm{\mathrm{arg\,min}}\,}}\varPhi _\varGamma ^{\textsc {fb}}\) and let \({\varvec{z}}\in {\text {T}}_\varGamma ^{\textsc {fb}}({\varvec{x}})\). By combining Lemmas 2.3(i) and 2.3(ii) we have that \({\varvec{z}}={\varvec{x}}\), that is, that \({\text {T}}_\varGamma ^{\textsc {fb}}({\varvec{x}})=\left\{ {\varvec{x}} \right\} \). It then follows from Lemma 2.3(ii) and assertion 2.4(i) that
$$\begin{aligned} \varPhi ({\varvec{x}}) = \varPhi ({\varvec{z}}) \le \varPhi _\varGamma ^{\textsc {fb}}({\varvec{x}}) = \min \varPhi _\varGamma ^{\textsc {fb}}= \min \varPhi , \end{aligned}$$hence \({\varvec{x}}\in {{\,\mathrm{\mathrm{arg\,min}}\,}}\varPhi \).
-
2.4(iii) Due to Lemma 2.3(i), if \(\varPhi _\varGamma ^{\textsc {fb}}\) is level bounded clearly so is \(\varPhi \). Conversely, suppose that \(\varPhi _\varGamma ^{\textsc {fb}}\) is not level bounded. Then, there exist \(\alpha \in \mathbb {R}\) and \(({\varvec{x}}^k)_{{k\in \mathbb {N}}}\subseteq {{\,\mathrm{lev}\,}}_{\le \alpha }\varPhi _\varGamma ^{\textsc {fb}}\) such that \(\Vert {\varvec{x}}^k\Vert \rightarrow \infty \) as \(k\rightarrow \infty \). Let \(\lambda =\min _i\left\{ \gamma _i^{-1}-L_{f_i}N^{-1} \right\} >0\), and for each \(k\in \mathbb {N}\) let \({\varvec{z}}^k\in {\text {T}}_\varGamma ^{\textsc {fb}}({\varvec{x}}^k)\). It then follows from Lemma 2.3(ii) that
$$\begin{aligned} \min \varPhi \le \varPhi ({\varvec{z}}^k) \le \varPhi _\varGamma ^{\textsc {fb}}({\varvec{x}}^k) - \tfrac{\lambda }{2}\Vert {\varvec{x}}^k-{\varvec{z}}^k\Vert ^2 \le \alpha - \tfrac{\lambda }{2}\Vert {\varvec{x}}^k-{\varvec{z}}^k\Vert ^2, \end{aligned}$$hence \(({\varvec{z}}^k)_{{k\in \mathbb {N}}}\subseteq {{\,\mathrm{lev}\,}}_{\le \alpha }\varPhi \) and \( \Vert {\varvec{x}}^k-{\varvec{z}}^k\Vert ^2 \le \tfrac{2}{\lambda }(\alpha -\min \varPhi ) \). Consequently, also the sequence \(({\varvec{z}}^k)_{{k\in \mathbb {N}}}\subseteq {{\,\mathrm{lev}\,}}_{\le \alpha }\varPhi \) is unbounded, proving that \(\varPhi \) is not level bounded. \(\square \)
1.2 A.2 Further results
This section contains a list of auxiliary results invoked in the main proofs of Sect. 2.
Lemma A.1
Suppose that Assumption I holds, and let two sequences \(({\varvec{u}}^k)_{{k\in \mathbb {N}}}\) and \(({\varvec{v}}^k)_{{k\in \mathbb {N}}}\) satisfy \({\varvec{v}}^k\in {\text {T}}_\varGamma ^{\textsc {fb}}({\varvec{u}}^k)\) for all \(k\) and be such that both converge to a point \({\varvec{u}}^\star \) as \(k\rightarrow \infty \). Then, \({\varvec{u}}^\star \in {\text {T}}_\varGamma ^{\textsc {fb}}({\varvec{u}}^\star )\), and in particular \(0\in \hat{\partial }\varPhi ({\varvec{u}}^\star )\).
Proof
Since \(\nabla F\) is continuous, it holds that \({{\varvec{u}}^k-\varGamma \nabla F({\varvec{u}}^k)}\rightarrow {{\varvec{u}}^\star -\varGamma \nabla F({\varvec{u}}^\star )}\) as \(k\rightarrow \infty \). From outer semicontinuity of \({{\,\mathrm{prox}\,}}_G^{\varGamma ^{-1}}\) [51, Ex. 5.23(b)] it then follows that
where the limit superior is meant in the Painlevé–Kuratowski sense, cf. [51, Def. 4.1]. The optimality conditions defining \({{\,\mathrm{prox}\,}}_G^{\varGamma ^{-1}}\) [51, Th. 10.1] then read
where the first and last equalities follow from [51, Ex. 8.8(c)]. \(\square \)
Lemma A.2
Suppose that Assumption I holds and that function \(G\) is convex. Then, the following hold:
- (i):
-
\({{\,\mathrm{prox}\,}}_G^{\varGamma ^{-1}}\) is (single-valued and) firmly nonexpansive (FNE) in the metric \(\Vert \cdot \Vert _{\varGamma ^{-1}}\); namely,
$$\begin{aligned} \Vert {{\,\mathrm{prox}\,}}_G^{\varGamma ^{-1}}({\varvec{u}}) - {{\,\mathrm{prox}\,}}_G^{\varGamma ^{-1}}({\varvec{v}}) \Vert _{\varGamma ^{-1}}^2 \le \langle {} {{\,\mathrm{prox}\,}}_G^{\varGamma ^{-1}}({\varvec{u}}) -{} {{\,\mathrm{prox}\,}}_G^{\varGamma ^{-1}}({\varvec{v}}) {},{} \varGamma ^{-1}({\varvec{u}}-{\varvec{v}}) {}\rangle \le \Vert {\varvec{u}} - {\varvec{v}} \Vert _{\varGamma ^{-1}}^2 \quad \forall {\varvec{u}},{\varvec{v}}; \end{aligned}$$ - (ii):
-
the Moreau envelope \(G^{\varGamma ^{-1}}\) is differentiable with \(\nabla G^{\varGamma ^{-1}}=\varGamma ^{-1}(\mathrm{id}-{{\,\mathrm{prox}\,}}_G^{\varGamma ^{-1}})\);
- (iii):
-
for every \({\varvec{x}}\in \mathbb {R}^{\sum _in_i}\) it holds that \( {{\,\mathrm{dist}\,}}(0,\partial \varPhi _\varGamma ^{\textsc {fb}}({\varvec{x}})) \le \tfrac{ N+\max _i\left\{ \gamma _iL_{f_i} \right\} }{ N\min _i\left\{ \sqrt{\gamma _i} \right\} } \Vert {\varvec{x}}-{\text {T}}_\varGamma ^{\textsc {fb}}({\varvec{x}})\Vert _{\varGamma ^{-1}} \);
- (iv):
-
\({\text {T}}_\varGamma ^{\textsc {fb}}\) is \(L_\mathbf{T}\)-Lipschitz continuous in the metric \(\Vert \cdot \Vert _{\varGamma ^{-1}}\) for some \(L_\mathbf{T}\ge 0\);
If in addition \(f_i\) is \(\mu _{f_i}\)-strongly convex, \(i\in [N]\), then the following hold:
- (v):
-
In A.2(iv), \(L_\mathbf{T}\le 1-\delta \) for \(\delta =\frac{1}{N}\min _{i\in [N]}\left\{ \gamma _i\mu _{f_i} \right\} \);
- (vi):
-
For every \({\varvec{x}}\in \mathbb {R}^{\sum _in_i}\)
$$\begin{aligned} \tfrac{1}{2}\Vert {\varvec{z}}-{\varvec{x}}^\star \Vert _{\mu _F}^2 \le \varPhi _\varGamma ^{\textsc {fb}}({\varvec{x}})-\min \varPhi \le \tfrac{1}{2}\Vert {\varvec{z}}-{\varvec{x}}\Vert _{\varGamma ^{-2}\mu _F^{-1}(\mathrm{I}-\varGamma \mu _F)}^2 \end{aligned}$$where \({\varvec{x}}^\star {:}{=}{{\,\mathrm{\mathrm{arg\,min}}\,}}\varPhi \), \( \mu _F {:}{=}\frac{1}{N}{{\,\mathrm{blkdiag}\,}}\bigl (\mu _{f_1}\mathrm{I}_{n_1},\dots ,\mu _{f_N}\mathrm{I}_{n_N}\bigr ) \), and \({\varvec{z}}={\text {T}}_\varGamma ^{\textsc {fb}}({\varvec{x}})\).
Proof
-
A.2(iii) Let \(D\subseteq \mathbb {R}^{\sum _in_i}\) be the set of points at which \(\nabla F\) is differentiable. From the chain rule of differentiation applied to the expression (2.2d) and using assertion A.2(ii), we have that \(\varPhi _\varGamma ^{\textsc {fb}}\) is differentiable on \(D\) with gradient
$$\begin{aligned} \nabla \varPhi _\varGamma ^{\textsc {fb}}({\varvec{x}}) = \bigl [ \mathrm{I}-\varGamma \nabla ^2F({\varvec{x}}) \bigr ] \varGamma ^{-1} \bigl [ {\varvec{x}}-{\text {T}}_\varGamma ^{\textsc {fb}}({\varvec{x}}) \bigr ] \quad \forall {\varvec{x}}\in D. \end{aligned}$$Since \(D\) is dense in \(\mathbb {R}^{\sum _in_i}\) owing to Lipschitz continuity of \(\nabla F\), we may invoke [51, Th. 9.61] to infer that \(\partial \varPhi _\varGamma ^{\textsc {fb}}({\varvec{x}})\) is nonempty for every \({\varvec{x}}\in \mathbb {R}^{\sum _in_i}\) and
$$\begin{aligned} \partial \varPhi _\varGamma ^{\textsc {fb}}({\varvec{x}}) \supseteq \partial _B\varPhi _\varGamma ^{\textsc {fb}}({\varvec{x}})= & {} \bigl [ \mathrm{I}-\varGamma \partial _B\nabla F({\varvec{x}}) \bigr ] \varGamma ^{-1} \bigl [ {\varvec{x}}-{\text {T}}_\varGamma ^{\textsc {fb}}({\varvec{x}}) \bigr ] \\= & {} \bigl [ \varGamma ^{-1}-\partial _B\nabla F({\varvec{x}}) \bigr ] \bigl [ {\varvec{x}}-{\text {T}}_\varGamma ^{\textsc {fb}}({\varvec{x}}) \bigr ], \end{aligned}$$where \(\partial _B\) denotes the (set-valued) Bouligand differential [22, §7.1]. The claim now follows by observing that \( \partial _B\nabla F({\varvec{x}}) = \tfrac{1}{N}{{\,\mathrm{blkdiag}\,}}(\partial _B\nabla f_1(x_1),\dots ,\partial _B\nabla f_N(x_N)) \) and that each element of \(\partial _B\nabla f_i(x_i)\) has norm bounded by \(L_{f_i}\).
-
A.2(iv) Lipschitz continuity follows from assertion A.2(i) together with the fact that Lipschitz continuity is preserved by composition.
-
A.2(v) By [41, Thm 2.1.12] for all \(x_i,y_i\in \mathbb {R}^{n_i}\)
$$\begin{aligned}&\langle \nabla f_i(x_i)-\nabla f_i(y_i),x_i-y_i\rangle \ge \tfrac{\mu _{f_i}L_{f_i}}{\mu _{f_i}+L_{f_i}}\Vert x_i-y_i\Vert ^2\nonumber \\&\quad +\tfrac{1}{\mu _{f_i}+L_{f_i}}\Vert \nabla f_i(x_i)-\nabla f_i(y_i)\Vert ^2. \end{aligned}$$(A.3)For the forward operator we have
$$\begin{aligned}&\Vert (\mathrm{id}-\tfrac{\gamma _i}{N}\nabla f_i)(x_i) - (\mathrm{id}-\tfrac{\gamma _i}{N}\nabla f_i)(y_i) \Vert ^2\\&= \Vert x_i-y_i\Vert ^2 + \tfrac{\gamma _i^2}{N^2} \Vert \nabla f_i(x_i)-\nabla f_i(y_i)\Vert ^2 - \tfrac{2\gamma _i}{N} \langle {}x_i-y_i{},{}\nabla f_i(x_i)-\nabla f_i(y_i){}\rangle \\&{\mathop {\le }\limits ^{(A.3)}} \Bigl ( 1-\tfrac{\gamma _i^2\mu _{f_i}L_{f_i}}{N^2} \Bigr ) \Vert x_i-y_i\Vert ^2 - \tfrac{\gamma _i}{N} \Bigl ( 2-\tfrac{\gamma _i}{N}(\mu _{f_i}+L_{f_i}) \Bigr ) \langle {}\nabla f_i(x_i)-\nabla f_i(y_i){},{}x_i-y_i{}\rangle \\&\le \left( 1-\tfrac{\gamma _i^2\mu _{f_i}L_{f_i}}{N^2}\right) \Vert x_i-y_i\Vert ^2 - \tfrac{\gamma _i\mu _{f_i}}{N} \left( 2-\tfrac{\gamma _i}{N}(\mu _{f_i}+L_{f_i})\right) \Vert x_i-y_i\Vert ^2\\&= \left( 1-\tfrac{\gamma _i\mu _{f_i}}{N}\right) ^2 \Vert x_i-y_i\Vert ^2, \end{aligned}$$where strong convexity and the fact that \(\gamma _i<\nicefrac {N}{L_{f_i}}\le \nicefrac {2N}{(\mu _{f_i}+L_{f_i})}\) were used in the second inequality. Multiplying by \(\gamma _i^{-1}\) and summing over i shows that \(\mathrm{id}-\varGamma \nabla F\) is \((1-\delta )\)-contractive in the metric \(\Vert \cdot \Vert _{\varGamma ^{-1}}\), and so is \({\text {T}}_\varGamma ^{\textsc {fb}}={{\,\mathrm{prox}\,}}_G^{\varGamma ^{-1}}\circ {}(\hbox {id} - \varGamma \nabla F)\) as it follows from assertion A.2(i).
-
A.2(vi) By strong convexity, denoting \(\varPhi _\star {:}{=}\min \varPhi \), we have
$$\begin{aligned} \varPhi _\star \le \varPhi ({\varvec{z}})-\tfrac{1}{2}\Vert {\varvec{z}}-{\varvec{x}}^\star \Vert _{\mu _F}^2 \le \varPhi _\varGamma ^{\textsc {fb}}({\varvec{x}}) - \tfrac{1}{2}\Vert {\varvec{z}}-{\varvec{x}}^\star \Vert _{\mu _F}^2 \end{aligned}$$where the second inequality follows from Lemma 2.3(ii). This establishes the lower bound. Since \({\varvec{z}}\) is a minimizer in (2.2a), the necessary stationarity condition reads \( \varGamma ^{-1}({\varvec{x}}-{\varvec{z}})-\nabla F({\varvec{x}}) \in \partial G({\varvec{z}})\). Convexity of \(G\) then implies
$$\begin{aligned} G({\varvec{x}}^\star ) \ge G({\varvec{z}}) + \langle {}\varGamma ^{-1}({\varvec{x}}-{\varvec{z}})-\nabla F({\varvec{x}}){},{}{\varvec{x}}^\star -{\varvec{z}}{}\rangle , \end{aligned}$$whereas from strong convexity of \(F\) we have
$$\begin{aligned} F({\varvec{x}}^\star ) \ge F({\varvec{x}}) + \langle {}\nabla F({\varvec{x}}){},{}{\varvec{x}}^\star -{\varvec{x}}{}\rangle + \tfrac{1}{2}\Vert {\varvec{x}}-{\varvec{x}}^\star \Vert ^2_{\mu _F}. \end{aligned}$$By combining these inequalities and (2.2b), we have
$$\begin{aligned} \varPhi _\varGamma ^{\textsc {fb}}({\varvec{x}})-\varPhi _\star&\le \tfrac{1}{2}\Vert {\varvec{z}}-{\varvec{x}}\Vert ^2_{\varGamma ^{-1}} - \tfrac{1}{2}\Vert {\varvec{x}}^\star -{\varvec{x}}\Vert ^2_{\mu _F} + \langle {}\varGamma ^{-1}({\varvec{z}}-{\varvec{x}}){},{}{\varvec{x}}^\star -{\varvec{z}}{}\rangle \\&= \tfrac{1}{2}\Vert {\varvec{z}}-{\varvec{x}}\Vert _{\varGamma ^{-1}-\mu _F}^2 + \langle {}(\varGamma ^{-1}-\mu _F)({\varvec{z}}-{\varvec{x}}){},{}{\varvec{x}}^\star -{\varvec{z}}{}\rangle - \tfrac{1}{2}\Vert {\varvec{x}}^\star -{\varvec{z}}\Vert _{\mu _F}^2. \end{aligned}$$Next, by using the inequality \( \langle {}{\varvec{a}}{},{}{\varvec{b}}{}\rangle \le \tfrac{1}{2}\Vert {\varvec{a}}\Vert _{\mu _F}^2 + \tfrac{1}{2}\Vert {\varvec{b}}\Vert ^2_{\mu _F^{-1}} \) to cancel out the last term, we obtain
$$\begin{aligned} \varPhi _\varGamma ^{\textsc {fb}}({\varvec{x}})-\varPhi _\star&\le \tfrac{1}{2}\Vert {\varvec{z}}-{\varvec{x}}\Vert _{\varGamma ^{-1}-\mu _F}^2 + \tfrac{1}{2}\Vert (\varGamma ^{-1}-\mu _F)({\varvec{x}}-{\varvec{z}})\Vert _{\mu _F^{-1}}^2\\&= \tfrac{1}{2}\Vert {\varvec{z}}-{\varvec{x}}\Vert _{\varGamma ^{-2}\mu _F^{-1}(\mathrm{I}-\varGamma \mu _F)}^2, \end{aligned}$$where the last identity uses the fact that the matrices are diagonal. \(\square \)
The next result recaps an important property that the FBE inherits from the cost function \(\varPhi \) that is instrumental for establishing global convergence and asymptotic linear rates for the BC Algorithm 1. The result falls as special case of [64, Th. 5.2] after observing that
where \( D_H({\varvec{w}},{\varvec{x}}) = H({\varvec{w}})-H({\varvec{x}})-\langle {}\nabla H({\varvec{x}}){},{}{\varvec{w}}-{\varvec{x}}{}\rangle \) is the Bregman distance with kernel \(H=\tfrac{1}{2}\Vert \cdot \Vert _{\varGamma ^{-1}}^2-F\).
Lemma A.3
([64, Th. 5.2]) Suppose that Assumption I holds and for \(\gamma _i\in (0,\nicefrac {N}{L_{f_i}})\), \(i\in [N]\), let \(\varGamma ={{\,\mathrm{blkdiag}\,}}(\gamma _1\mathrm{I}_{n_1},\dots ,\gamma _N\mathrm{I}_{n_N})\). If \(\varPhi \) has the KL property with exponent \(\theta \in (0,1)\) (as is the case when \(f_i\) and \(G\) are semialgebraic), then so does \(\varPhi _\varGamma ^{\textsc {fb}}\) with exponent \( \max \left\{ \nicefrac 12,\theta \right\} \).
Rights and permissions
About this article
Cite this article
Latafat, P., Themelis, A. & Patrinos, P. Block-coordinate and incremental aggregated proximal gradient methods for nonsmooth nonconvex problems. Math. Program. 193, 195–224 (2022). https://doi.org/10.1007/s10107-020-01599-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-020-01599-7