Skip to main content

Convergence Rate Analysis of Several Splitting Schemes

Part of the Scientific Computation book series (SCIENTCOMP)

Abstract

Operator-splitting schemes are iterative algorithms for solving many types of numerical problems. A lot is known about these methods: they converge, and in many cases we know how quickly they converge. But when they are applied to optimization problems, there is a gap in our understanding: The theoretical speed of operator-splitting schemes is nearly always measured in the ergodic sense, but ergodic operator-splitting schemes are rarely used in practice. In this chapter, we tackle the discrepancy between theory and practice and uncover fundamental limits of a class of operator-splitting schemes. Our surprising conclusion is that the relaxed Peaceman-Rachford splitting algorithm, a version of the Alternating Direction Method of Multipliers (ADMM), is nearly as fast as the proximal point algorithm in the ergodic sense and nearly as slow as the subgradient method in the nonergodic sense. A large class of operator-splitting schemes extend from the relaxed Peaceman-Rachford splitting algorithm. Our results show that this class of operator-splitting schemes is also nearly as slow as the subgradient method. The tools we create in this chapter can also be used to prove nonergodic convergence rates of more general splitting schemes, so they are interesting in their own right.

This work is supported in part by NSF grants DMS-1317602 and ECCS-1462398.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-41589-5_4
  • Chapter length: 49 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   169.00
Price excludes VAT (USA)
  • ISBN: 978-3-319-41589-5
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   219.99
Price excludes VAT (USA)
Hardcover Book
USD   219.99
Price excludes VAT (USA)
Fig. 4.1

Notes

  1. 1.

    E.g., with gigabytes to terabytes of data.

  2. 2.

    By ergodic, we mean the final approximate solution returned by the algorithm is an average over the history of all approximate solutions formed throughout the algorithm.

  3. 3.

    After the initial release of this chapter, we used these tools to study several more general algorithms [25, 26, 27]

  4. 4.

    This list is not exhaustive. See the comments after Theorem 8 for more details.

  5. 5.

    \(x^{{\ast}}\in \mathcal{H}\) is a minimizer of (4.0).

  6. 6.

    A recent result reported in Chapter 5 [55] of this volume shows the direct (non-duality) equivalence between ADMM and DRS when they are both applied to Problem (4.2).

References

  1. Bauschke, H.H., Bello Cruz, J.Y., Nghia, T.T.A., Phan, H.M., Wang, X.: The rate of linear convergence of the Douglas-Rachford algorithm for subspaces is the cosine of the Friedrichs angle. Journal of Approximation Theory 185 (0), 63–79 (2014)

    MATH  MathSciNet  CrossRef  Google Scholar 

  2. Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer (2011)

    Google Scholar 

  3. Bauschke, H.H., Deutsch, F., Hundal, H.: Characterizing arbitrarily slow convergence in the method of alternating projections. International Transactions in Operational Research 16 (4), 413–425 (2009)

    MATH  MathSciNet  CrossRef  Google Scholar 

  4. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2 (1), 183–202 (2009)

    MATH  MathSciNet  CrossRef  Google Scholar 

  5. Bertsekas, D.P.: Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning pp. 85–120 (2011)

    Google Scholar 

  6. Boţ, R.I., Csetnek, E.R.: On the convergence rate of a forward-backward type primal-dual splitting algorithm for convex optimization problems. Optimization 64 (1), 5–23 (2015)

    MATH  MathSciNet  CrossRef  Google Scholar 

  7. Boţ, R.I., Hendrich, C.: A Douglas–Rachford type primal-dual method for solving inclusions with mixtures of composite and parallel-sum type monotone operators. SIAM Journal on Optimization 23 (4), 2541–2565 (2013)

    MATH  MathSciNet  CrossRef  Google Scholar 

  8. Boţ, R.I., Hendrich, C.: Solving monotone inclusions involving parallel sums of linearly composed maximally monotone operators. arXiv:1306.3191 [math] (2013)

    Google Scholar 

  9. Boţ, R.I., Hendrich, C.: Convergence analysis for a primal-dual monotone+ skew splitting algorithm with applications to total variation minimization. Journal of Mathematical Imaging and Vision pp. 1–18 (2014)

    Google Scholar 

  10. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning 3 (1), 1–122 (2011)

    MATH  CrossRef  Google Scholar 

  11. Bredies, K.: A forward–backward splitting algorithm for the minimization of non-smooth convex functionals in Banach space. Inverse Problems 25 (1), 015,005 (2009)

    MATH  MathSciNet  CrossRef  Google Scholar 

  12. Brézis, H., Lions, P.L.: Produits infinis de résolvantes. Israel Journal of Mathematics 29 (4), 329–345 (1978)

    MATH  MathSciNet  CrossRef  Google Scholar 

  13. Briceño-Arias, L.M.: Forward-Douglas-Rachford splitting and forward-partial inverse method for solving monotone inclusions. Optimization 64 (5), 1239–1261 (2015)

    MATH  MathSciNet  CrossRef  Google Scholar 

  14. Briceno-Arias, L.M., Combettes, P.L.: A monotone+ skew splitting model for composite monotone inclusions in duality. SIAM Journal on Optimization 21 (4), 1230–1250 (2011)

    MATH  MathSciNet  CrossRef  Google Scholar 

  15. Browder, F.E., Petryshyn, W.V.: The solution by iteration of nonlinear functional equations in Banach spaces. Bulletin of the American Mathematical Society 72 (3), 571–575 (1966)

    MATH  MathSciNet  CrossRef  Google Scholar 

  16. Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision 40 (1), 120–145 (2011)

    MATH  MathSciNet  CrossRef  Google Scholar 

  17. Combettes, P.L.: Quasi-Fejérian analysis of some optimization algorithms. Studies in Computational Mathematics 8, 115–152 (2001)

    MATH  CrossRef  Google Scholar 

  18. Combettes, P.L.: Solving monotone inclusions via compositions of nonexpansive averaged operators. Optimization 53 (5–6), 475–504 (2004)

    MATH  MathSciNet  CrossRef  Google Scholar 

  19. Combettes, P.L.: Systems of structured monotone inclusions: duality, algorithms, and applications. SIAM Journal on Optimization 23 (4), 2420–2447 (2013)

    MATH  MathSciNet  CrossRef  Google Scholar 

  20. Combettes, P.L., Condat, L., Pesquet, J.C., Vu, B.C.: A forward-backward view of some primal-dual optimization methods in image recovery. In: 2014 IEEE International Conference on Image Processing (ICIP), pp. 4141–4145 (2014)

    Google Scholar 

  21. Combettes, P.L., Pesquet, J.C.: Primal-dual splitting algorithm for solving inclusions with mixtures of composite, Lipschitzian, and parallel-sum type monotone operators. Set-Valued and variational analysis 20 (2), 307–330 (2012)

    MATH  MathSciNet  CrossRef  Google Scholar 

  22. Cominetti, R., Soto, J.A., Vaisman, J.: On the rate of convergence of Krasnosel’skiĭ-Mann iterations and their connection with sums of Bernoullis. Israel Journal of Mathematics pp. 1–16 (2014)

    Google Scholar 

  23. Condat, L.: A primal–dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms. Journal of Optimization Theory and Applications 158 (2), 460–479 (2013)

    MATH  MathSciNet  CrossRef  Google Scholar 

  24. Corman, E., Yuan, X.: A generalized proximal point algorithm and its convergence rate. SIAM Journal on Optimization 24 (4), 1614–1638 (2014)

    MATH  MathSciNet  CrossRef  Google Scholar 

  25. Davis, D.: Convergence Rate Analysis of Primal-Dual Splitting Schemes. SIAM Journal on Optimization 25 (3), 1912–1943 (2015)

    MATH  MathSciNet  CrossRef  Google Scholar 

  26. Davis, D.: Convergence Rate Analysis of the Forward-Douglas-Rachford Splitting Scheme. SIAM Journal on Optimization 25 (3), 1760–1786 (2015)

    MATH  MathSciNet  CrossRef  Google Scholar 

  27. Davis, D., Yin, W.: A three-operator splitting scheme and its optimization applications. arXiv preprint arXiv:1504.01032v1 (2015)

    Google Scholar 

  28. Deng, W., Lai, M.J., Yin, W.: On the o(1∕k) convergence and parallelization of the alternating direction method of multipliers. arXiv preprint arXiv:1312.3040 (2013)

    Google Scholar 

  29. Eckstein, J., Bertsekas, D.P.: On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming 55 (1–3), 293–318 (1992)

    MATH  MathSciNet  CrossRef  Google Scholar 

  30. Franchetti, C., Light, W.: On the von Neumann alternating algorithm in Hilbert space. Journal of mathematical analysis and applications 114 (2), 305–314 (1986)

    MATH  MathSciNet  CrossRef  Google Scholar 

  31. Gabay, D.: Application of the methods of multipliers to variational inequalities. In: M. Fortin, R. Glowinski (eds.) Augmented Lagrangians: Application to the Numerical Solution of Boundary Value Problems, pp. 299–331. North-Holland, Amsterdam (1983)

    CrossRef  Google Scholar 

  32. Glowinski, R., Marrocco, A.: Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de Dirichlet nonlinéaires. Rev. Francaise dAut. Inf. Rech. Oper R-2, 41–76 (1975)

    MATH  Google Scholar 

  33. Güler, O.: On the Convergence of the Proximal Point Algorithm for Convex Minimization. SIAM Journal on Control and Optimization 29 (2), 403–419 (1991)

    MATH  MathSciNet  CrossRef  Google Scholar 

  34. He, B., Yuan, X.: On the O(1∕n) convergence rate of the Douglas-Rachford alternating direction method. SIAM Journal on Numerical Analysis 50 (2), 700–709 (2012)

    MATH  MathSciNet  CrossRef  Google Scholar 

  35. He, B., Yuan, X.: On non-ergodic convergence rate of Douglas–Rachford alternating direction method of multipliers. Numerische Mathematik 130 (3), 567–577 (2015)

    MATH  MathSciNet  CrossRef  Google Scholar 

  36. Knopp, K.: Infinite Sequences and Series. Courier Dover Publications (1956)

    Google Scholar 

  37. Krasnosel’skiĭ, M.A.: Two remarks on the method of successive approximations. Uspekhi Matematicheskikh Nauk 10 (1), 123–127 (1955)

    MathSciNet  Google Scholar 

  38. Liang, J., Fadili, J., Peyré, G.: Convergence rates with inexact nonexpansive operators. arXiv preprint arXiv:1404.4837. Mathematical Programming 159:403 (2016)

    Google Scholar 

  39. Lions, P.L., Mercier, B.: Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis 16 (6), 964–979 (1979)

    MATH  MathSciNet  CrossRef  Google Scholar 

  40. Mann, W.R.: Mean value methods in iteration. Proceedings of the American Mathematical Society 4 (3), 506–510 (1953)

    MATH  MathSciNet  CrossRef  Google Scholar 

  41. Monteiro, R.D., Svaiter, B.F.: Iteration-complexity of block-decomposition algorithms and the alternating direction method of multipliers. SIAM Journal on Optimization 23 (1), 475–507 (2013)

    MATH  MathSciNet  CrossRef  Google Scholar 

  42. Monteiro, R.D.C., Svaiter, B.F.: On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean. SIAM Journal on Optimization 20 (6), 2755–2787 (2010)

    MATH  MathSciNet  CrossRef  Google Scholar 

  43. Monteiro, R.D.C., Svaiter, B.F.: Complexity of Variants of Tseng’s Modified F-B Splitting and Korpelevich’s Methods for Hemivariational Inequalities with Applications to Saddle-point and Convex Optimization Problems. SIAM Journal on Optimization 21 (4), 1688–1720 (2011)

    MATH  MathSciNet  CrossRef  Google Scholar 

  44. Nemirovsky, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley, New York (1983)

    Google Scholar 

  45. Nesterov, Y.: Introductory Lectures on Convex Optimization, vol. 87. Springer US, Boston, MA (2004)

    MATH  Google Scholar 

  46. Ogura, N., Yamada, I.: Non-strictly convex minimization over the fixed point set of an asymptotically shrinking nonexpansive mapping. Numerical Functional Analysis and Optimization 23 (1–2), 113–137 (2002)

    MATH  MathSciNet  CrossRef  Google Scholar 

  47. Passty, G.B.: Ergodic convergence to a zero of the sum of monotone operators in Hilbert space. Journal of Mathematical Analysis and Applications 72 (2), 383 – 390 (1979)

    MATH  MathSciNet  CrossRef  Google Scholar 

  48. Pock, T., Cremers, D., Bischof, H., Chambolle, A.: An algorithm for minimizing the Mumford-Shah functional. In: Computer Vision, 2009 IEEE 12th International Conference on, pp. 1133–1140. IEEE (2009)

    Google Scholar 

  49. Schizas, I.D., Ribeiro, A., Giannakis, G.B.: Consensus in ad hoc WSNs with noisy links. Part I: Distributed estimation of deterministic signals. Signal Processing, IEEE Transactions on 56 (1), 350–364 (2008)

    Google Scholar 

  50. Shefi, R., Teboulle, M.: Rate of convergence analysis of decomposition methods based on the proximal method of multipliers for convex minimization. SIAM Journal on Optimization 24 (1), 269–297 (2014)

    MATH  MathSciNet  CrossRef  Google Scholar 

  51. Shi, W., Ling, Q., Yuan, K., Wu, G., Yin, W.: On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Transactions on Signal Processing 62 (7), 1750–1761 (2014)

    MathSciNet  CrossRef  Google Scholar 

  52. Tseng, P.: A modified forward-backward splitting method for maximal monotone mappings. SIAM Journal on Control and Optimization 38 (2), 431–446 (2000)

    MATH  MathSciNet  CrossRef  Google Scholar 

  53. Vũ, B.C.: A splitting algorithm for dual monotone inclusions involving cocoercive operators. Advances in Computational Mathematics 38 (3), 667–681 (2013)

    MATH  MathSciNet  CrossRef  Google Scholar 

  54. Wei, E., Ozdaglar, A.: Distributed alternating direction method of multipliers. In: Decision and Control (CDC), 2012 IEEE 51st Annual Conference on, pp. 5445–5450. IEEE (2012)

    Google Scholar 

  55. Yan, M., Yin, W.: Self equivalence of the alternating direction method of multipliers. In: R. Glowinski, S. Osher, W. Yin (eds.) Splitting Methods in Communication and Imaging, Science and Engineering. Springer (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Damek Davis .

Editor information

Editors and Affiliations

Appendices

A Further Applications of the Results of Section 3

1.1 o(1∕(k + 1)2) FPR of FBS and PPA

In problem (4.0), let g be a C 1 function with Lipschitz derivative. The forward-backward splitting (FBS) algorithm is the iteration:

$$\displaystyle\begin{array}{rcl} z^{k+1} = \mathbf{prox}_{\gamma f}(z^{k} -\gamma \nabla g(z^{k})),\quad k = 0,1,\ldots.& &{}\end{array}$$
(4.27)

The FBS algorithm generalizes and has the following subgradient representation:

$$\displaystyle\begin{array}{rcl} z^{k+1} = z^{k} -\gamma \widetilde{\nabla }f(z^{k+1}) -\gamma \nabla g(z^{k})& &{}\end{array}$$
(4.28)

where \(\widetilde{\nabla }f(z^{k+1}):= (1/\gamma )(z^{k} - z^{k+1} -\gamma \nabla g(z^{k})) \in \partial f(z^{k+1})\), and z k+1 and \(\widetilde{\nabla }f(z^{k+1})\) are unique given z k and γ > 0.

In this section, we analyze the convergence rate of the FBS algorithm given in Equations (4.27) and (4.28). If g = 0, FBS reduces to the proximal point algorithm (PPA) and β = . If f = 0, FBS reduces to gradient descent. The FBS algorithm can be written in the following operator form:

$$\displaystyle\begin{array}{rcl} T_{\mathrm{FBS}}:= \mathbf{prox}_{\gamma f} \circ (I -\gamma \nabla g).& & {}\\ \end{array}$$

Because prox γ f is (1∕2)-averaged and Iγg is γ∕(2β)-averaged [46, Theorem 3(b)], it follows that T FBS is α FBS-averaged for

$$\displaystyle\begin{array}{rcl} \alpha _{\mathrm{FBS}}:= \frac{2\beta } {4\beta -\gamma } \in (1/2,1)& & {}\\ \end{array}$$

whenever γ < 2β [2, Proposition 4.32]. Thus, we have T FBS = (1 −α FBS)I +α FBS T for a certain nonexpansive operator T, and T FBS(z k) − z k = α FBS(Tz kz k). In particular, for all γ < 2β the following sum is finite:

$$\displaystyle\begin{array}{rcl} \sum _{i=0}^{\infty }\|T_{\mathrm{ FBS}}(z^{k}) - z^{k}\|^{2}\stackrel{(<InternalRef RefID="Equ17">4.10</InternalRef>)}{\leq }\frac{\alpha _{\mathrm{FBS}}\|z^{0} - z^{{\ast}}\|^{2}} {(1 -\alpha _{\mathrm{FBS}})}.& & {}\\ \end{array}$$

To analyze the FBS algorithm we need to derive a joint subgradient inequality for f + g. First, we recall the following sufficient descent property for Lipschitz differentiable functions.

Theorem 11 (Descent Theorem [2, Theorem 18.15(iii)]).

If g is differentiable and ∇g is (1∕β)-Lipschitz, then for all \(x,y \in \mathcal{H}\) we have the upper bound

$$\displaystyle\begin{array}{rcl} g(x) \leq g(y) + \langle x - y,\nabla g(y)\rangle + \frac{1} {2\beta }\|x - y\|^{2}.& & {}\\ \end{array}$$

Corollary 4 (Joint Descent Theorem).

If g is differentiable and ∇g is (1∕β)-Lipschitz, then for all points x,y ∈dom (f) and \(z \in \mathcal{H}\) , and subgradients \(\widetilde{\nabla }f(x) \in \partial f(x)\) , we have

$$\displaystyle\begin{array}{rcl} f(x) + g(x) \leq f(y) + g(y) + \langle x - y,\nabla g(z) +\widetilde{ \nabla }f(x)\rangle + \frac{1} {2\beta }\|z - x\|^{2}.& &{}\end{array}$$
(4.29)

Proof.

Inequality (4.29) follows from adding the upper bound

$$\displaystyle\begin{array}{rcl} g(x) - g(y) \leq g(z) - g(y) + \langle x - z,\nabla g(z)\rangle + \frac{1} {2\beta }\|z - x\|^{2} \leq \langle x - y,\nabla g(z)\rangle + \frac{1} {2\beta }\|z - x\|^{2}& & {}\\ \end{array}$$

with the subgradient inequality: \(f(x) \leq f(y) + \langle x - y,\widetilde{\nabla }f(x)\rangle\). □ 

We now improve the O(1∕(k + 1)2) FPR rate for PPA in [12, Théorème 9] by showing that the FPR rate of FBS is actually o(1∕(k + 1)2).

Theorem 12 (Objective and FPR Convergence of FBS).

Let z 0 dom (f) ∩dom (g) and let x be a minimizer of f + g. Suppose that (z j ) j≥0 is generated by FBS (iteration  (4.27) ) where ∇g is (1∕β)-Lipschitz and γ < 2β. Then for all k ≥ 0,

$$\displaystyle\begin{array}{rcl} h(z^{k+1},z^{k+1}) \leq \frac{\|z^{0} - x^{{\ast}}\|^{2}} {k + 1} \times \left \{\begin{array}{@{}l@{\quad }l@{}} \frac{1} {2\gamma } \quad &\mbox{if $\gamma \leq \beta$ }; \\ \left (\frac{1} {2\gamma } + \left (\frac{1} {2\beta } -\frac{1} {2\gamma }\right ) \frac{\alpha _{\mathrm{FBS}}} {(1-\alpha _{\mathrm{FBS}})}\right )\quad &\text{otherwise,} \end{array} \right.& & {}\\ \end{array}$$

and

$$\displaystyle{h(z^{k+1},z^{k+1}) = o(1/(k + 1)),}$$

where the objective-error function h is defined in  (4.19) . In addition, for all k ≥ 0, we have ∥T FBS z k+1 − z k+1 2 = o(1∕(k + 1) 2 ) and

$$\displaystyle\begin{array}{rcl} \|T_{\mathrm{FBS}}z^{k+1} - z^{k+1}\|^{2}& \leq & \frac{\|z^{0} - x^{{\ast}}\|^{2}} {\big(\frac{1} {\gamma } -\frac{1} {2\beta }\big)(k + 1)^{2}} \times \left \{\begin{array}{@{}l@{\quad }l@{}} \frac{1} {2\gamma } \quad &\mbox{if $\gamma \leq \beta$ }; \\ \left (\frac{1} {2\gamma } +\big (\frac{1} {2\beta } -\frac{1} {2\gamma }\big) \frac{\alpha _{\mathrm{FBS}}} {(1-\alpha _{\mathrm{FBS}})}\right )\quad &\text{otherwise}. \end{array} \right.{}\\ \end{array}$$

Proof.

Recall that \(z^{k} - z^{k+1} =\gamma \widetilde{ \nabla }f(z^{k+1}) +\gamma \nabla g(z^{k}) \in \gamma \partial f(z^{k+1}) +\gamma \nabla g(z^{k})\) for all k ≥ 0. Thus, the joint descent theorem shows that for all x ∈ dom(f), we have

$$\displaystyle\begin{array}{rcl} & & f(z^{k+1}) + g(z^{k+1}) - f(x) - g(x)\stackrel{(<InternalRef RefID="Equ117">4.29</InternalRef>)}{\leq }\frac{1} {\gamma } \langle z^{k+1} - x,z^{k} - z^{k+1}\rangle + \frac{1} {2\beta }\|z^{k} - z^{k+1}\|^{2} \\ & & \phantom{f(z^{k+1}) + g(z^{k+1})-} = \frac{1} {2\gamma }\left (\|z^{k}\! -\! x\|^{2}\! -\!\| z^{k+1}\! -\! x\|^{2}\right )\! +\! \left (\frac{1} {2\beta } -\frac{1} {2\gamma }\right )\|z^{k+1}\! -\! z^{k}\|^{2}. {}\end{array}$$
(4.30)

If we set x = x in Equation (4.30), we see that (h(z j+1, z j+1)) j ≥ 0 is positive, summable, and

$$\displaystyle\begin{array}{rcl} \sum _{i=0}^{\infty }h(z^{j+1},z^{j+1}) \leq \left \{\begin{array}{@{}l@{\quad }l@{}} \frac{1} {2\gamma }\|z^{0} - x^{{\ast}}\|^{2} \quad &\mbox{if $\gamma \leq \beta$ }; \\ \left (\frac{1} {2\gamma } + \left (\frac{1} {2\beta } -\frac{1} {2\gamma }\right ) \frac{\alpha _{\mathrm{FBS}}} {(1-\alpha _{\mathrm{FBS}})}\right )\|z^{0} - x^{{\ast}}\|^{2}\quad &\text{otherwise}. \end{array} \right.& &{}\end{array}$$
(4.31)

In addition, if we set x = z k in Equation (4.30), then we see that (h(z j+1, z j+1)) j ≥ 0 is decreasing:

$$\displaystyle\begin{array}{rcl} \left (\frac{1} {\gamma } -\frac{1} {2\beta }\right )\|z^{k+1} - z^{k}\|^{2} \leq h(z^{k},z^{k}) - h(z^{k+1},z^{k+1}).& & {}\\ \end{array}$$

Therefore, the rates for h(z k+1, z k+1) follow by Lemma 1 Part (a), with a k  = h(z k+1, z k+1) and λ k  ≡ 1.

Now we prove the rates for ∥ T FBS z k+1z k+1 ∥ 2. We apply Part 3 of Lemma 1 with \(a_{k} = \left (1/\gamma - 1/(2\beta )\right )\|z^{k+2} - z^{k+1}\|^{2}\), λ k  ≡ 1, e k  = 0, and b k  = h(z k+1) − h(x ) for all k ≥ 0, to show that i = 0 (i + 1)a i is less than the sum in Equation (4.31). Part 2 of Theorem 1 shows that (a j ) j ≥ 0 is monotonically nonincreasing. Therefore, the convergence rate of (a j ) j ≥ 0 follows from Part (b) of Lemma 1. □ 

When f = 0, the objective error upper bound in Theorem 12 is strictly better than the bound provided in [45, Corollary 2.1.2]. In FBS, the objective error rate is the same as the one derived in [4, Theorem 3.1], when γ ∈ (0, β], and is the same as the one given in [11] in the case that γ ∈ (0, 2β). The little-o FPR rate is new in all cases except for the special case of PPA (g ≡ 0) under the condition that the sequence (z j) j ≥ 0 strongly converges to a minimizer [33].

1.2 o(1∕(k + 1)2) FPR of One Dimensional DRS

Whenever the operator (T PRS)1∕2 is applied in R, the convergence rate of the FPR improves to o(1∕(k + 1)2).

Theorem 13.

Suppose that \(\mathcal{H}\ = \mathbf{R}\) , and suppose that (z j ) j≥0 is generated by the DRS algorithm, i.e., Algorithm  1 with λ k ≡ 1∕2. Then for all k ≥ 0,

$$\displaystyle\begin{array}{rcl} \vert (T_{\mathrm{PRS}})_{1/2}z^{k+1} - z^{k+1}\vert ^{2} = \frac{\vert z^{0} - z^{{\ast}}\vert ^{2}} {2(k + 1)^{2}} \ \ \mathrm{and}\ \ \vert (T_{\mathrm{PRS}})_{1/2}z^{k+1} - z^{k+1}\vert ^{2} = o\left ( \frac{1} {(k + 1)^{2}}\right ).& & {}\\ \end{array}$$

Proof.

Note that (T PRS)1∕2 is (1∕2)-averaged, and, hence, it is the resolvent of some maximal monotone operator on R [2, Corollary 23.8]. Furthermore, every maximal monotone operator on R is the subdifferential operator of a closed, proper, and convex function [2, Corollary 22.19]. Therefore, DRS is equivalent to the proximal point algorithm applied to a certain convex function on R. Thus, the result follows by Theorem 12 applied to this function. □ 

B Further Lower Complexity Results

2.1 Ergodic Convergence of Feasibility Problems

Proposition 8.

The ergodic feasibility convergence rate in Equation  (4.20) is optimal up to a factor of two.

Proof.

Algorithm 1 with λ k  = 1 for all k ≥ 0 (i.e., PRS) is applied to the functions \(f =\iota _{\{(x_{1},x_{2})\in \mathbf{R}^{2}\vert x_{1}=0\}}\) and \(g =\iota _{\{(x_{1},x_{2})\in \mathbf{R}^{2}\vert x_{2}=0\}}\) with the initial iterate z 0 = (1, 1) ∈ R 2. Because \(T_{\mathrm{PRS}} = -I_{\mathcal{H}\ }\), it is easy to see that the only fixed point of T PRS is z  = (0, 0). In addition, the following identities are satisfied:

$$\displaystyle\begin{array}{rcl} x_{g}^{k} = \left \{\begin{array}{@{}l@{\quad }l@{}} (1,0) \quad &\text{even}\ k; \\ (-1,0)\quad &\text{odd}\ k. \end{array} \right.\quad z^{k} = \left \{\begin{array}{@{}l@{\quad }l@{}} (1,1) \quad &\text{even}\ k; \\ (-1,-1)\quad &\text{odd}\ k. \end{array} \right.\quad x_{f}^{k} = \left \{\begin{array}{@{}l@{\quad }l@{}} (0,-1)\quad &\text{even}\ k; \\ (0,1) \quad &\text{odd}\ k. \end{array} \right.& & {}\\ \end{array}$$

Thus, the PRS algorithm oscillates around the solution x  = (0, 0). However, note that the averaged iterates satisfy:

$$\displaystyle\begin{array}{rcl} \overline{x}_{g}^{k} = \left \{\begin{array}{@{}l@{\quad }l@{}} ( \frac{1} {k+1},0)\quad &\text{even}\ k; \\ (0,0) \quad &\text{odd}\ k. \end{array} \right.\quad \mathrm{and}\quad \overline{x}_{f}^{k} = \left \{\begin{array}{@{}l@{\quad }l@{}} (0, \frac{-1} {k+1})\quad &\text{even}\ k; \\ (0,0) \quad &\text{odd}\ k. \end{array} \right.& & {}\\ \end{array}$$

It follows that \(\|\overline{x}_{g}^{k} -\overline{x}_{f}^{k}\| = (1/(k + 1))\|(1,-1)\| = (1/(k + 1))\|z^{0} - z^{{\ast}}\|\), \(\forall k \geq 0\). □ 

2.2 Optimal Objective and FPR Rates with Lipschitz Derivative

The following examples show that the objective and FPR rates derived in Theorem 12 are essentially optimal. The setup of the following counterexample already appeared in [12, Remarque 6] but the objective function lower bounds were not shown.

Theorem 14 (Lower Complexity of PPA).

There exists a Hilbert space \(\mathcal{H}\) , and a closed, proper, and convex function f such that for all α > 1∕2, there exists \(z^{0} \in \mathcal{H}\) such that if (z j ) j≥0 is generated by PPA, then

$$\displaystyle\begin{array}{rcl} \|\mathbf{prox}_{\gamma f}(z^{k}) - z^{k}\|^{2}& \geq & \frac{\gamma ^{2}} {(1 + 2\alpha )e^{2\gamma }(k+\gamma )^{1+2\alpha }}, {}\\ f(z^{k+1}) - f(x^{{\ast}})& \geq & \frac{1} {4\alpha e^{2\gamma }(k + 1+\gamma )^{2\alpha }}. {}\\ \end{array}$$

Proof.

Let \(\mathcal{H}\ =\ell _{2}(\mathbf{R})\), and define a linear map \(A: \mathcal{H}\ \rightarrow \mathcal{H}\):

$$\displaystyle\begin{array}{rcl} A\left (z_{1},z_{2},\cdots \,,z_{n},\cdots \,\right )& =& \left (z_{1}, \frac{z_{2}} {2},\cdots \,, \frac{z_{n}} {n},\cdots \,\right ). {}\\ \end{array}$$

For all \(z \in \mathcal{H}\), define \(f(x) = (1/2)\langle Az,z\rangle\). Thus, we have the proximal identity for f and

$$\displaystyle\begin{array}{rcl} \mathbf{prox}_{\gamma f}(z)& =& (I +\gamma A)^{-1}(z) = \left ( \frac{j} {j+\gamma }z_{j}\right )_{j\geq 1}\quad \mathrm{and}\quad (I -\mathbf{prox}_{\gamma f})(z) = \left ( \frac{\gamma } {j+\gamma }z_{j}\right )_{j\geq 1}. {}\\ \end{array}$$

Now let \(z^{0} = (1/(j+\gamma )^{\alpha })_{j\geq 1} \in \mathcal{H}\), and set T = prox γ f . Then we get the following FPR lower bound:

$$\displaystyle\begin{array}{rcl} \|z^{k+1} - z^{k}\|^{2} =\| T^{k}(T - I)z^{0}\|^{2}& =& \sum _{ i=1}^{\infty }\left ( \frac{i} {i+\gamma }\right )^{2k} \frac{\gamma ^{2}} {(i+\gamma )^{2+2\alpha }} {}\\ & \geq & \sum _{i=k}^{\infty }\left ( \frac{i} {i+\gamma }\right )^{2k} \frac{\gamma ^{2}} {(i+\gamma )^{2+2\alpha }} {}\\ & \geq & \frac{\gamma ^{2}} {(1 + 2\alpha )e^{2\gamma }(k+\gamma )^{1+2\alpha }}. {}\\ \end{array}$$

Furthermore, the objective lower bound holds

$$\displaystyle\begin{array}{rcl} f(z^{k+1}) - f(x^{{\ast}}) = \frac{1} {2}\langle Az^{k+1},z^{k+1}\rangle & =& \frac{1} {2}\sum _{i=1}^{\infty }\frac{1} {i} \left ( \frac{i} {i+\gamma }\right )^{2(k+1)} \frac{1} {(i+\gamma )^{2\alpha }} {}\\ & \geq & \frac{1} {2}\sum _{i=k+1}^{\infty }\left ( \frac{i} {i+\gamma }\right )^{2(k+1)} \frac{1} {(i+\gamma )^{1+2\alpha }} {}\\ & \geq & \frac{1} {4\alpha e^{2\gamma }(k + 1+\gamma )^{2\alpha }}. {}\\ \end{array}$$

 □ 

C ADMM Convergence Rate Proofs

Given an initial vector \(z^{0} \in \mathcal{G}\), Lemma 2 shows that at each iteration relaxed PRS performs the following computations:

$$\displaystyle\begin{array}{rcl} \left \{\begin{array}{@{}l@{\quad }l@{}} w_{d_{g}}^{k} = \mathbf{prox}_{\gamma d_{g}}(z^{k}); \quad \\ w_{d_{f}}^{k} = \mathbf{prox}_{\gamma d_{f}}(2w_{d_{g}}^{k} - z^{k}); \quad \\ z^{k+1} = z^{k} + 2\lambda _{k}(w_{d_{f}}^{k} - w_{d_{g}}^{k}).\quad \end{array} \right.& & {}\\ \end{array}$$

In order to apply the relaxed PRS algorithm, we need to compute the proximal operators of the dual functions d f and d g .

Lemma 9 (Proximity Operators on the Dual).

Let \(w,v \in \mathcal{G}\) . Then the update formulas \(w^{+} = \mathbf{prox}_{\gamma d_{f}}(w)\) and \(v^{+} = \mathbf{prox}_{\gamma d_{g}}(v)\) are equivalent to the following computations

$$\displaystyle\begin{array}{rcl} & & \left \{\begin{array}{@{}l@{\quad }l@{}} x^{+} =\mathop{ \mathrm{arg\,min}}\limits _{ x\in \mathcal{H}\ _{1}}f(x) -\langle w,Ax\rangle + \frac{\gamma } {2}\|Ax\|^{2};\quad \\ w^{+} = w -\gamma Ax^{+}; \quad \end{array} \right. \\ & & \left \{\begin{array}{@{}l@{\quad }l@{}} y^{+} =\mathop{ \mathrm{arg\,min}}\limits _{y\in \mathcal{H}\ _{2}}g(y) -\langle v,By - b\rangle + \frac{\gamma } {2}\|By - b\|^{2};\quad \\ v^{+} = v -\gamma (By^{+} - b); \quad \end{array} \right.{}\end{array}$$
(4.32)

respectively. In addition, the subgradient inclusions hold: A w + ∈ ∂f(x + ) and B v + ∈ ∂g(y + ). Finally, w + and v + are independent of the choice of x + and y + , respectively, even if they are not unique solutions to the minimization subproblems.

We can use Lemma 9 to derive the relaxed form of ADMM in Algorithm 2. Note that this form of ADMM eliminates the “hidden variable” sequence (z j) j ≥ 0 in Equation (4.32). This following derivation is not new, but is included for the sake of completeness. See [31] for the original derivation.

Proposition 9 (Relaxed ADMM).

Let \(z^{0} \in \mathcal{G}\) , and let (z j ) j≥0 be generated by the relaxed PRS algorithm applied to the dual formulation in Equation  (4.27) . Choose initial points \(w_{d_{g}}^{-1} = z^{0},x^{-1} = 0\) and y −1 = 0 and initial relaxation λ −1 = 1∕2. Then we have the following identities starting from k = −1:

$$\displaystyle\begin{array}{rcl} y^{k+1}& =& \mathop{\mathrm{arg\,min}}\limits _{ y\in \mathcal{H}\ _{2}}g(y) -\langle w_{d_{g}}^{k},Ax^{k} + By - b\rangle + {}\\ & & \qquad \quad \frac{\gamma } {2}\|Ax^{k} + By - b + (2\lambda _{ k} - 1)(Ax^{k} + By^{k} - b)\|^{2} {}\\ w_{d_{g}}^{k+1}& =& w_{ d_{g}}^{k} -\gamma (Ax^{k} + By^{k+1} - b) -\gamma (2\lambda _{ k} - 1)(Ax^{k} + By^{k} - b) {}\\ x^{k+1}& =& \mathop{\mathrm{arg\,min}}\limits _{ x\in \mathcal{H}\ _{1}}f(x) -\langle w_{d_{g}}^{k+1},Ax + By^{k+1} - b\rangle + \frac{\gamma } {2}\|Ax + By^{k+1} - b\|^{2} {}\\ w_{d_{f}}^{k+1}& =& w_{ d_{g}}^{k+1} -\gamma (Ax^{k+1} + By^{k+1} - b) {}\\ \end{array}$$

Proof.

By Equation (4.32) and Lemma 9, we get the following formulation for the k-th iteration: Given \(z^{0} \in \mathcal{H}\)

$$\displaystyle\begin{array}{rcl} \left \{\begin{array}{@{}l@{\quad }l@{}} y^{k} \quad &=\mathop{ \mathrm{arg\,min}}\limits _{y\in \mathcal{H}\ _{2}}g(y) -\langle z^{k},By - b\rangle + \frac{\gamma } {2}\|By - b\|^{2}; \\ w_{d_{g}}^{k}\quad &= z^{k} -\gamma (By^{k} - b); \\ x^{k} \quad &=\mathop{ \mathrm{arg\,min}}\limits _{x\in \mathcal{H}\ _{1}}f(x) -\langle 2w_{d_{g}}^{k} - z^{k},Ax\rangle + \frac{\gamma } {2}\|Ax\|^{2}; \\ w_{d_{f}}^{k}\quad &= 2w_{d_{g}}^{k} - z^{k} -\gamma Ax^{k}; \\ z^{k+1} \quad &= z^{k} + 2\lambda _{k}(w_{d_{f}}^{k} - w_{d_{g}}^{k}).\end{array} \right.& & {}\\ \end{array}$$

We will use this form to get to the claimed iteration. First,

$$\displaystyle\begin{array}{rcl} 2w_{d_{g}}^{k} - z^{k} = w_{ d_{g}}^{k} -\gamma (By^{k} - b)\quad \mathrm{and}\quad w_{ d_{f}}^{k} = w_{ d_{g}}^{k} -\gamma (Ax^{k} + By^{k} - b).& &{}\end{array}$$
(4.33)

Furthermore, we can simplify the definition of x k:

$$\displaystyle\begin{array}{rcl} x^{k}& = & \mathop{\mathrm{arg\,min}}\limits _{ x\in \mathcal{H}\ _{1}}f(x) -\langle 2w_{d_{g}}^{k} - z^{k},Ax\rangle + \frac{\gamma } {2}\|Ax\|^{2} {}\\ & \stackrel{\mathrm{(<InternalRef RefID="Equ136">4.33</InternalRef>)}}{=}& \mathop{\mathrm{arg\,min}}\limits _{x\in \mathcal{H}\ _{1}}f(x) -\langle w_{d_{g}}^{k} -\gamma (By^{k} - b),Ax\rangle + \frac{\gamma } {2}\|Ax\|^{2} {}\\ & = & \mathop{\mathrm{arg\,min}}\limits _{x\in \mathcal{H}\ _{1}}f(x) -\langle w_{d_{g}}^{k},Ax + By^{k} - b\rangle + \frac{\gamma } {2}\|Ax + By^{k} - b\|^{2}. {}\\ \end{array}$$

Note that the last two lines of Equation (4.34) differ by terms independent of x.

We now eliminate the z k variable from the y k subproblem: because \(w_{d_{f}}^{k} + z^{k} = 2w_{d_{g}}^{k} -\gamma Ax^{k}\), we have

$$\displaystyle\begin{array}{rcl} z^{k+1}& = & z^{k} + 2\lambda _{ k}(w_{d_{f}}^{k} - w_{ d_{g}}^{k}) {}\\ & \stackrel{\mathrm{(<InternalRef RefID="Equ136">4.33</InternalRef>)}}{=}& z^{k} + w_{ d_{f}}^{k} - w_{ d_{g}}^{k} +\gamma (2\lambda _{ k} - 1)(Ax^{k} + By^{k} - b) {}\\ & = & w_{d_{g}}^{k} -\gamma Ax^{k} -\gamma (2\lambda _{ k} - 1)(Ax^{k} + By^{k} - b). {}\\ \end{array}$$

We can simplify the definition of y k+1 by applying the identity in Equation (4.34):

$$\displaystyle\begin{array}{rcl} & & y^{k+1} =\mathop{ \mathrm{arg\,min}}\limits _{ y\in \mathcal{H}\ _{2}}g(y) -\langle z^{k+1},By - b\rangle + \frac{\gamma } {2}\|By - b\|^{2} {}\\ & & \stackrel{\mathrm{(C)}}{=}\mathop{\mathrm{arg\,min}}\limits _{y\in \mathcal{H}\ _{2}}g(y) -\langle w_{d_{g}}^{k} -\gamma Ax^{k} -\gamma (2\lambda _{ k} - 1)(Ax^{k} + By^{k} - b),By - b\rangle + \frac{\gamma } {2}\|By - b\|^{2} {}\\ & & =\mathop{ \mathrm{arg\,min}}\limits _{y\in \mathcal{H}\ _{2}}g(y) -\langle w_{d_{g}}^{k},Ax^{k} + By - b\rangle + \frac{\gamma } {2}\|Ax^{k} + By - b + (2\lambda _{ k} - 1)(Ax^{k} + By^{k} - b)\|^{2}.{}\\ \end{array}$$

The result then follows from Equations (4.33), (4.33), (4.34), and (4.34), combined with the initial conditions listed in the statement of the proposition. In particular, note that the updates of \(x,y,w_{d_{f}},\) and \(w_{d_{g}}\) do not explicitly depend on z.  □ 

Remark 2.

Proposition 9 proves that \(w_{d_{f}}^{k+1} = w_{d_{g}}^{k+1} -\gamma (Ax^{k+1} + By^{k+1} - b)\). Recall that by Equation (4.32), \(z^{k+1} - z^{k} = 2\lambda _{k}(w_{d_{f}}^{k} - w_{d_{g}}^{k})\). Therefore, it follows that

$$\displaystyle\begin{array}{rcl} z^{k+1} - z^{k}& =& -2\gamma \lambda _{ k}(Ax^{k} + By^{k} - b).{}\end{array}$$
(4.34)

3.1 Dual Feasibility Convergence Rates

We can apply the results of Section 5 to deduce convergence rates for the dual objective functions. Instead of restating those theorems, we just list the following bounds on the feasibility of the primal iterates.

Theorem 15.

Suppose that (z j ) j≥0 is generated by Algorithm  2 , and let (λ j ) j≥0 ⊆ (0,1]. Then the following convergence rates hold:

  1. 1.

    Ergodic convergence: The feasibility convergence rate holds:

    $$\displaystyle\begin{array}{rcl} \|A\overline{x}^{k} + B\overline{y}^{k} - b\|^{2} = \frac{4\|z^{0} - z^{{\ast}}\|^{2}} {\gamma \varLambda _{k}^{2}}.& & {}\\ \end{array}$$
  2. 2.

    Nonergodic convergence: Suppose that τ = infj≥0 λ j (1 −λ j ) > 0. Then

    $$\displaystyle\begin{array}{rcl} \|Ax^{k} + By^{k} - b\|^{2} \leq \frac{\|z^{0} - z^{{\ast}}\|^{2}} {4\gamma ^{2}\underline{\tau }(k + 1)}\quad \mathrm{and}\quad \|Ax^{k} + By^{k} - b\|^{2} = o\left ( \frac{1} {k + 1}\right ).& & {}\\ \end{array}$$

Proof.

Parts 1 and 2 are straightforward applications of Corollary 1. and the FPR identity: \(z^{k} - z^{k+1}\stackrel{\mbox{ (<InternalRef RefID="Equ140">4.34</InternalRef>)}}{=}2\gamma \lambda _{k}(Ax^{k} + By^{k} - b).\) □ 

3.2 Converting Dual Inequalities to Primal Inequalities

The ADMM algorithm generates the following five sequences of iterates:

$$\displaystyle\begin{array}{rcl} (z^{j})_{ j\geq 0},(w_{d_{f}}^{j})_{ j\geq 0},\ \text{and}\ (w_{d_{g}}^{j})_{ j\geq 0} \subseteq \mathcal{G}\quad \text{and}\quad (x^{j})_{ j\geq 0} \in \mathcal{H}\ _{1},(y^{j})_{ j\geq 0} \in \mathcal{H}\ _{2}.& & {}\\ \end{array}$$

The dual variables do not necessarily have a meaningful interpretation, so it is desirable to derive convergence rates involving the primal variables. In this section we will apply the Fenchel-Young inequality [2, Proposition 16.9] to convert the dual objective into a primal expression.

The following proposition will help us derive primal fundamental inequalities akin to Proposition 2 and 3.

Proposition 10.

Suppose that (z j ) j≥0 is generated by Algorithm  2 . Let z be a fixed point of T PRS and let \(w^{{\ast}} = \mathbf{prox}_{\gamma d_{f}}(z^{{\ast}})\) . Then the following identity holds:

$$\displaystyle\begin{array}{rcl} & & 4\gamma \lambda _{k}(h(x^{k},y^{k})) = -4\gamma \lambda _{ k}(d_{f}(w_{d_{f}}^{k}) + d_{ g}(w_{d_{g}}^{k}) - d_{ f}(w^{{\ast}}) - d_{ g}(w^{{\ast}})) \\ & & \phantom{4\gamma \lambda _{k}(h(x^{k},}+\left (2\left (1 - \frac{1} {2\lambda _{k}}\right )\|z^{k} - z^{k+1}\|^{2} + 2\langle z^{k} - z^{k+1},z^{k+1}\rangle \right ). {}\end{array}$$
(4.35)

Proof.

We have the following subgradient inclusions from Proposition 9: \(A^{{\ast}}w_{d_{f}}^{k} \in \partial f(x^{k})\) and \(B^{{\ast}}w_{d_{g}}^{k} \in \partial g(y^{k}).\) From the Fenchel-Young inequality [2, Proposition 16.9] we have the expression for f and g:

$$\displaystyle\begin{array}{rcl} d_{f}(w_{d_{f}}^{k}) = \langle A^{{\ast}}w_{ d_{f}}^{k},x^{k}\rangle - f(x^{k})\quad \mathrm{and}\quad d_{ f}(w_{d_{g}}^{k}) = \langle B^{{\ast}}w_{ d_{g}}^{k},y^{k}\rangle - g(y^{k}) -\langle w_{ d_{g}}^{k},b\rangle.& & {}\\ \end{array}$$

Therefore,

$$\displaystyle\begin{array}{rcl} -d_{f}(w_{d_{f}}^{k}) - d_{ g}(w_{d_{g}}^{k}) = f(x^{k}) + g(y^{k}) -\langle Ax^{k} + By^{k} - b,w_{ d_{f}}^{k}\rangle -\langle w_{ d_{g}}^{k} - w_{ d_{f}}^{k},By^{k} - b\rangle.& & {}\\ \end{array}$$

Let us simplify this bound with an identity from Proposition 9: from \(w_{d_{f}}^{k} - w_{d_{g}}^{k} = -\gamma (Ax^{k} + By^{k} - b),\) it follows that

$$\displaystyle\begin{array}{rcl} - d_{f}(w_{d_{f}}^{k}) - d_{ g}(w_{d_{g}}^{k})& =& f(x^{k}) + g(y^{k}) + \frac{1} {\gamma } \langle w_{d_{f}}^{k} - w_{ d_{g}}^{k},w_{ d_{f}}^{k} +\gamma (By^{k} - b)\rangle.\qquad {}\end{array}$$
(4.36)

Recall that \(\gamma (By^{k} - b) = z^{k} - w_{d_{g}}^{k}\). Therefore

$$\displaystyle{w_{d_{f}}^{k}+\gamma (By^{k}-b) = z^{k}+(w_{ d_{f}}^{k}-w_{ d_{g}}^{k}) = z^{k}+ \frac{1} {2\lambda _{k}}(z^{k+1}-z^{k}) = \frac{1} {2\lambda _{k}}(2\lambda _{k}-1)(z^{k}-z^{k+1})+z^{k+1},}$$

and the inner product term can be simplified as follows:

$$\displaystyle\begin{array}{rcl} \frac{1} {\gamma } \langle w_{d_{f}}^{k} - w_{ d_{g}}^{k},w_{ d_{f}}^{k} +\gamma (By^{k} - b)\rangle & =& \frac{1} {\gamma } \langle \frac{1} {2\lambda _{k}}(z^{k+1} - z^{k}), \frac{1} {2\lambda _{k}}(2\lambda _{k} - 1)(z^{k} - z^{k+1})\rangle \\ & +& \frac{1} {\gamma } \langle \frac{1} {2\lambda _{k}}(z^{k+1} - z^{k}),z^{k+1}\rangle \\ & =& -\frac{1} {2\gamma \lambda _{k}}\left (1 - \frac{1} {2\lambda _{k}}\right )\|z^{k+1} - z^{k}\|^{2} \\ & -& \frac{1} {2\gamma \lambda _{k}}\langle z^{k} - z^{k+1},z^{k+1}\rangle. {}\end{array}$$
(4.37)

Now we derive an expression for the dual objective at a dual optimal w . First, if z is a fixed point of T PRS, then \(0 = T_{\mathrm{PRS}}(z^{{\ast}}) - z^{{\ast}} = 2(w_{d_{g}}^{{\ast}}- w_{d_{f}}^{{\ast}}) = -2\gamma (Ax^{{\ast}} + By^{{\ast}}- b)\). Thus, from Equation (4.36) with k replaced by ∗, we get

$$\displaystyle\begin{array}{rcl} - d_{f}(w^{{\ast}}) - d_{ g}(w^{{\ast}})& = f(x^{{\ast}}) + g(y^{{\ast}}) + \langle Ax^{{\ast}} + Bx^{{\ast}}- b,w^{{\ast}}\rangle = f(x^{{\ast}}) + g(y^{{\ast}}).\qquad \ &{}\end{array}$$
(4.38)

Therefore, Equation (4.35) follows by subtracting (4.38) from Equation (4.36), rearranging and using the identity in Equation (4.37). □ 

The following two propositions prove two fundamental inequalities that bound the primal objective.

Proposition 11 (ADMM Primal Upper Fundamental Inequality).

Let z be a fixed point of T PRS and let \(w^{{\ast}} = \mathbf{prox}_{\gamma d_{g}}(z^{{\ast}})\) . Then for all k ≥ 0, we have the bound:

$$\displaystyle\begin{array}{rcl} 4\gamma \lambda _{k}h(x^{k},y^{k}) \leq \| z^{k} - (z^{{\ast}}- w^{{\ast}})\|^{2} -\| z^{k+1} - (z^{{\ast}}- w^{{\ast}})\|^{2} + \left (1 -\frac{1} {\lambda _{k}}\right )\|z^{k} - z^{k+1}\|^{2},& &{}\end{array}$$
(4.39)

where the objective-error function h is defined in  (4.19) .

Proof.

The lower inequality in Proposition 3 applied to d f + d g shows that

$$\displaystyle\begin{array}{rcl} -4\gamma \lambda _{k}(d_{f}(w_{d_{f}}^{k}) + d_{ g}(w_{d_{g}}^{k}) - d_{ f}(w^{{\ast}}) - d_{ g}(w^{{\ast}}))& \leq 2\langle z^{k+1} - z^{k},z^{{\ast}}- w^{{\ast}}\rangle.& {}\\ \end{array}$$

The proof then follows from Proposition 10, and the simplification:

$$\displaystyle\begin{array}{rcl} & & 2\langle z^{k} - z^{k+1},z^{k+1} - (z^{{\ast}}- w^{{\ast}})\rangle + 2\left (1 - \frac{1} {2\lambda _{k}}\right )\|z^{k} - z^{k+1}\|^{2} {}\\ & & =\| z^{k} - (z^{{\ast}}- w^{{\ast}})\|^{2} -\| z^{k+1} - (z^{{\ast}}- w^{{\ast}})\|^{2} + \left (1 -\frac{1} {\lambda _{k}}\right )\|z^{k} - z^{k+1}\|^{2}. {}\\ \end{array}$$

 □ 

Remark 3.

Note that Equation (4.39) is nearly identical to the upper inequality in Proposition 2, except that z w appears in the former where x appears in the latter.

Proposition 12 (ADMM Primal Lower Fundamental Inequality).

Let z be a fixed point of T PRS and let \(w^{{\ast}} = \mathbf{prox}_{\gamma d_{g}}(z^{{\ast}})\) . Then for all \(x \in \mathcal{H}\ _{1}\) and \(y \in \mathcal{H}\ _{2}\) we have the bound:

$$\displaystyle\begin{array}{rcl} h(x,y) \geq \langle Ax + By - b,w^{{\ast}}\rangle,& &{}\end{array}$$
(4.40)

where the objective-error function h is defined in  (4.19) .

Proof.

The lower bound follows from the subgradient inequalities:

$$\displaystyle\begin{array}{rcl} f(x) - f(x^{{\ast}}) \geq \langle x - x^{{\ast}},A^{{\ast}}w^{{\ast}}\rangle \quad \text{and}\quad g(y) - g(y^{{\ast}}) \geq \langle y - y^{{\ast}},B^{{\ast}}w^{{\ast}}\rangle.& & {}\\ \end{array}$$

We sum these inequalities and use Ax + By  = b to get Equation (4.40).   □ 

Remark 4.

We use Inequality (4.40) in two special cases:

$$\displaystyle\begin{array}{rcl} h(x^{k},y^{k})& \geq & \frac{1} {\gamma } \langle w_{d_{g}}^{k} - w_{ d_{f}}^{k},w^{{\ast}}\rangle {}\\ h(\overline{x}^{k},\overline{y}^{k})& \geq & \frac{1} {\gamma } \langle \overline{w}_{d_{g}}^{k} -\overline{w}_{ d_{f}}^{k},w^{{\ast}}\rangle. {}\\ \end{array}$$

These bounds are nearly identical to the fundamental lower inequality in Proposition 3, except that w appears in the former where z x appeared in the latter.

3.3 Converting Dual Convergence Rates to Primal Convergence Rates

We can use the inequalities deduced in Section 3.2 to derive convergence rates for the primal objective values. The structure of the proofs of Theorems 9 and 10 are exactly the same as in the primal convergence case in Section 5, except that we use the upper and lower inequalities derived in the Section 3.2 instead of the fundamental upper and lower inequalities in Propositions 2 and 3. This amounts to replacing the term z x and x by w and z w , respectively, in all of the inequalities from Section 5. Thus, we omit the proofs.

D Examples

In this section, we apply relaxed PRS and relaxed ADMM to concrete problems and explicitly bound the associated objectives and FPR terms with the convergence rates we derived in the previous sections.

4.1 Feasibility Problems

Suppose that C f and C g are closed convex subsets of \(\mathcal{H}\), with nonempty intersection. The goal of the feasibility problem is to find a point in the intersection of C f and C g . In this section, we present one way to model this problem using convex optimization and apply the relaxed PRS algorithm to reach the minimum.

In general, we cannot expect linear convergence of relaxed PRS algorithm for the feasibility problem. We showed this in Theorem 6 by constructing an example for which the DRS iteration converges in norm but does so arbitrarily slow. A similar result holds for the alternating projection (AP) algorithm [3]. Thus, in this section we focus on the convergence rate of the FPR.

Let \(\iota _{C_{f}}\) and \(\iota _{C_{g}}\) be the indicator functions of C f and C g . Then x ∈ C f C g , if, and only if, \(\iota _{C_{f}}(x) +\iota _{C_{g}}(x) = 0\), and the sum is infinite otherwise. Thus, a point is in the intersection of C f and C g if, and only if, it is the minimizer of the following problem:

$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{minimize}}\limits _{x\in \mathcal{H}\ }\iota _{C_{f}}(x) +\iota _{C_{g}}(x).& & {}\\ \end{array}$$

The relaxed PRS algorithm applied to this problem, with \(f =\iota _{C_{f}}\) and \(g =\iota _{C_{g}}\), has the following form: Given \(z^{0} \in \mathcal{H}\), for all k ≥ 0, let

$$\displaystyle\begin{array}{rcl} \left \{\begin{array}{@{}l@{\quad }l@{}} x_{g}^{k} = P_{C_{g}}(z^{k}); \quad \\ x_{f}^{k} = P_{C_{f}}(2x_{g}^{k} - z^{k}); \quad \\ z^{k+1} = z^{k} + 2\lambda _{k}(x_{f}^{k} - x_{g}^{k}).\quad \end{array} \right.& & {}\\ \end{array}$$

Because \(f =\iota _{C_{f}}\) and \(g =\iota _{C_{g}}\) only take on the values 0 and , the objective value convergence rates derived earlier do not provide meaningful information, other than x f k ∈ C f and x g k ∈ C g . However, from the FPR identity x f kx g k = 1∕(2λ k )(z k+1z k), we find that after k iterations, Corollary 1 produces the bound

$$\displaystyle\begin{array}{rcl} \max \{d_{C_{g}}^{2}(x_{ f}^{k}),d_{ C_{f}}^{2}(x_{ g}^{k})\} \leq \| x_{ f}^{k} - x_{ g}^{k}\|^{2} = o\left ( \frac{1} {k + 1}\right )& & {}\\ \end{array}$$

whenever (λ j ) j ≥ 0 is bounded away from 0 and 1. Theorem 5 showed that this rate is optimal. Furthermore, if we average the iterates over all k, Theorem 3 gives the improved bound

$$\displaystyle\begin{array}{rcl} \max \{d_{C_{g}}^{2}(\overline{x}_{ f}^{k}),d_{ C_{f}}^{2}(\overline{x}_{ g}^{k})\} \leq \|\overline{x}_{ f}^{k} -\overline{x}_{ g}^{k}\|^{2} = O\left ( \frac{1} {\varLambda _{k}^{2}}\right ),& & {}\\ \end{array}$$

which is optimal by Proposition 8. Note that the averaged iterates satisfy \(\overline{x}_{f}^{k} = (1/\varLambda _{k})\sum _{i=0}^{k}\lambda _{i}x_{f}^{i} \in C_{f}\) and \(\overline{x}_{g}^{k} = (1/\varLambda _{k})\sum _{i=0}^{k}\lambda _{i}x_{g}^{i} \in C_{g}\), because C f and C g are convex. Thus, we can state the following proposition:

Proposition 13.

After k iterations the relaxed PRS algorithm produces a point in each set with distance of order O(1∕Λ k ) from each other.

4.2 Parallelized Model Fitting and Classification

The following general scenario appears in [10, Chapter 8]. Consider the following general convex model fitting problem: Let M: R n → R m be a feature matrix, let b ∈ R m be the output vector, let l: R m → (−, ] be a loss function and let r: R n → (−, ] be a regularization function. The model fitting problem is formulated as the following minimization:

$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{minimize}}\limits _{x\in \mathbf{R}^{n}}\;l(Mx - b) + r(x).& &{}\end{array}$$
(4.41)

The function l is used to enforce the constraint Mx = b +ν up to some noise ν in the measurement, while r enforces the regularity of x by incorporating prior knowledge of the form of the solution. The function r can also be used to enforce the uniqueness of the solution of Mx = b in ill-posed problems.

We can solve Equation (4.41) by a direct application of relaxed PRS and obtain O(1∕Λ k ) ergodic convergence and \(o\left (1/\sqrt{k + 1}\right )\) nonergodic convergence rates. Note that these rates do not require differentiability of f or g. In contrast, the FBS algorithm requires differentiability of one of the objective functions and a knowledge of the Lipschitz constant of its gradient. The advantage of FBS is the o(1∕(k + 1)) convergence rate shown in Theorem 12. However, we do not necessarily assume that l is differentiable, so we may need to compute prox γ l(M(⋅ )−b), which can be significantly more difficult than computing prox γ l . Thus, in this section we separate M from l by rephrasing Equation (4.41) in the form of Problem (4.2).

In this section, we present several different ways to split Equation (4.41). Each splitting gives rise to a different algorithm and can be applied to general convex functions l and r. Our results predict convergence rates that hold for primal objectives, dual objectives, and the primal feasibility. Note that in parallelized model fitting, it is not always desirable to take the time average of all of the iterates. Indeed, when r enforces sparsity, averaging the current r-iterate with old iterates, all of which are sparse, can produce a non-sparse iterate. This will slow down vector additions and prolong convergence.

4.2.1 Auxiliary Variable

We can split Equation (4.41) by defining an auxiliary variable for Myb:

$$\displaystyle\begin{array}{rcl} & & \mathop{\mathrm{minimize}}\limits _{x\in \mathbf{R}^{m},y\in \mathbf{R}^{n}}\;l\left (x\right ) + r(y) \\ & & \text{subject to }\;My - x = b. {}\end{array}$$
(4.42)

The constraint in Equation (4.42) reduces to Ax + By = b where B = M and \(A = -I_{\mathbf{R}^{m}}\). If we set f = l and g = r and apply ADMM, the analysis of Section 3.3 shows that

$$\displaystyle\begin{array}{rcl} \vert l(x^{k}) + r(y^{k}) - l(My^{{\ast}}- b) - r(y^{{\ast}})\vert & =& o\left ( \frac{1} {\sqrt{k + 1}}\right ) {}\\ \|My^{k} - b - x^{k}\|^{2}& =& o\left ( \frac{1} {k + 1}\right ). {}\\ \end{array}$$

In particular, if l is Lipschitz, \(\vert l(x^{k}) - l(My^{k} - b)\vert = o\left (1/\sqrt{k + 1}\right )\). Thus, we have

$$\displaystyle\begin{array}{rcl} \vert l(My^{k} - b) + r(y^{k}) - l(My^{{\ast}}- b) - r(y^{{\ast}})\vert = o\left ( \frac{1} {\sqrt{k + 1}}\right ).& & {}\\ \end{array}$$

A similar analysis shows that

$$\displaystyle\begin{array}{rcl} \vert l(M\overline{y}^{k} - b) + r(\overline{y}^{k}) - l(My^{{\ast}}- b) - r(y^{{\ast}})\vert & =& O\left (\frac{1} {\varLambda _{k}}\right ) {}\\ \|M\overline{y}^{k} - b -\overline{x}^{k}\|^{2}& =& O\left ( \frac{1} {\varLambda _{k}^{2}}\right ). {}\\ \end{array}$$

In the next two splittings, we leave the derivation of convergence rates to the reader.

4.2.2 Splitting Across Examples

We assume that l is block separable: we have l(Mxb) =  i = 1 R l i (M i xb i ) where

$$\displaystyle\begin{array}{rcl} M = \left [\begin{array}{*{10}c} M_{1}\\ \vdots \\ M_{R} \end{array} \right ]\quad \text{and}\quad b = \left [\begin{array}{*{10}c} b_{1}\\ \vdots \\ b_{R} \end{array} \right ].& & {}\\ \end{array}$$

Each \(M_{i} \in \mathbf{R}^{m_{i}\times n}\) is a submatrix of M, each \(b_{i} \in \mathbf{R}^{m_{i}}\) is a subvector of b, and i = 1 R m i  = m. Therefore, an equivalent form of Equation (4.41) is given by

$$\displaystyle\begin{array}{rcl} & & \mathop{\mathrm{minimize}}\limits _{x_{1},\cdots \,,x_{R},y\in \mathbf{R}^{n}}\;\sum _{i=1}^{R}l_{ i}(M_{i}x_{i} - b_{i}) + r(y) \\ & & \text{subject to }\;x_{r} - y = 0,\quad r = 1,\cdots \,,R. {}\end{array}$$
(4.43)

We say that Equation (4.43) is split across examples. Thus, to apply ADMM to this problem, we simply stack the vectors x i , i = 1, ⋯ , R into a vector x = (x 1, ⋯ , x R )T ∈ R nR. Then the constraints in Equation (4.43) reduce to Ax + By = 0 where \(A = I_{\mathbf{R}^{nR}}\) and By = (−y, ⋯ , −y)T.

4.2.3 Splitting Across Features

We can also split Equation (4.41) across features, whenever r is block separable in x, in the sense that there exists C > 0, such that r =  i = 1 C r i (x i ), and \(x_{i} \in \mathbf{R}^{n_{i}}\) where i = 1 C n i  = n. This splitting corresponds to partitioning the columns of M, i.e., \(M = \left [\begin{array}{*{10}c} M_{1},\cdots \,,M_{C} \end{array} \right ],\) and \(M_{i} \in \mathbf{R}^{m\times n_{i}}\), for all i = 1, ⋯ , C. For all y ∈ R n, My =  i = 1 C M i y i . With this notation, we can derive an equivalent form of Equation (4.41) given by

$$\displaystyle\begin{array}{rcl} & & \mathop{\mathrm{minimize}}\limits _{x,y\in \mathbf{R}^{n}}\;l\left (\sum _{i=1}^{C}x_{ i} - b\right ) +\sum _{ i=1}^{C}r_{ i}(y_{i}) \\ & & \text{subject to }\;x_{i} - M_{i}y_{i} = 0,\quad i = 1,\cdots \,,C. {}\end{array}$$
(4.44)

The constraint in Equation (4.44) reduces to Ax + By = 0 where \(A = I_{\mathbf{R}^{mC}}\) and By = −(M 1 y 1, ⋯ , M C y C )T ∈ R nC.

4.3 Distributed ADMM

In this section our goal is to use Algorithm 2 for

$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{minimize}}\limits _{x\in \mathcal{H}\ }\sum _{i=1}^{m}f_{ i}(x)& & {}\\ \end{array}$$

by using the splitting in [49]. Note that we could minimize this function by reformulating it in the product space \(\mathcal{H}\ ^{m}\) as follows:

$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{minimize}}\limits _{\mathbf{x}\in \mathcal{H}\,^{m}}\sum _{i=1}^{m}f_{ i}(x_{i}) +\iota _{D}(\mathbf{x}),& & {}\\ \end{array}$$

where \(D =\{ (x,\cdots \,,x) \in \mathcal{H}\ ^{m}\mid x \in \mathcal{H}\ \}\) is the diagonal set. Applying relaxed PRS to this problem results in a parallel algorithm where each function performs a local minimization step and then communicates its local variable to a central processor. In this section, we assign each function a local variable but we never communicate it to a central processor. Instead, each function only communicates with neighbors.

Formally, we assume there is a simple, connected, undirected graph G = (V, E) on | V |  = m vertices with edges E that describe a connection among the different functions. We introduce a variable \(x_{i} \in \mathcal{H}\) for each function f i , and, hence, we set \(\mathcal{H}\ _{1} = \mathcal{H}\ ^{m}\), (see Section 8). We can encode the constraint that each node communicates with neighbors by introducing an auxiliary variable for each edge in the graph:

$$\displaystyle\begin{array}{rcl} & & \mathop{\mathrm{minimize}}\limits _{\mathbf{x}\in \mathcal{H}\ ^{m},\mathbf{y}\in \mathcal{H}\ ^{\vert E\vert }}\;\sum _{i=1}^{m}f_{ i}(x_{i}) \\ & & \text{subject to }\;x_{i} = y_{ij},x_{j} = y_{ij},\text{ for all }(i,j) \in E.{}\end{array}$$
(4.45)

The linear constraints in Equation (4.45) can be written in the form of A x + B y = 0 for proper matrices A and B. Thus, we reformulate Equation (4.45) as

$$\displaystyle\begin{array}{rcl} & & \mathop{\mathrm{minimize}}\limits _{\mathbf{x}\in \mathcal{H}\ ^{m},\mathbf{y}\in \mathcal{H}\ ^{\vert E\vert }}\;\sum _{i=1}^{m}f_{ i}(x_{i}) + g(\mathbf{y}) \\ & & \text{subject to }\;A\mathbf{x} + B\mathbf{y} = 0, {}\end{array}$$
(4.46)

where \(g: \mathcal{H}\ ^{\vert E\vert }\rightarrow \mathbf{R}\) is the zero map.

Because we only care about finding the value of the variable \(\mathbf{x} \in \mathcal{H}\ ^{m}\), the following simplification can be made to the sequences generated by ADMM applied to Equation (4.46) with λ k  = 1∕2 for all k ≥ 1 [51]: Let \(\mathcal{N}_{i}\) denote the set of neighbors of i ∈ V and set x i 0 = α i 0 = 0 for all i ∈ V. Then for all k ≥ 0,

$$\displaystyle\begin{array}{rcl} \left \{\begin{array}{@{}l@{\quad }l@{}} x_{i}^{k+1} =\mathop{ \mathrm{arg\,min}}\limits _{x_{i}\in \mathcal{H}\ }f_{i}(x) + \frac{\gamma \vert \mathcal{N}_{i}\vert } {2} \|x_{i} - x_{i}^{k} - \frac{1} {\vert \mathcal{N}_{i}\vert }\sum _{j\in \mathcal{N}_{i}}x_{j}^{k} + \frac{1} {\gamma \vert \mathcal{N}_{i}\vert }\alpha _{i}\|^{2} + \frac{\gamma \vert \mathcal{N}_{i}\vert } {2} \|x_{i}\|^{2}\quad \\ \alpha _{i}^{k+1} =\alpha _{ i}^{k} +\gamma \left (\vert \mathcal{N}_{i}\vert x_{i}^{k+1} -\sum _{j\in \mathcal{N}_{i}}x_{j}^{k+1}\right ). \quad \end{array} \right.& & {}\\ \end{array}$$

The above iteration is truly distributed because each node i ∈ V only requires information from its local neighbors at each iteration.

In [51], linear convergence is shown for this algorithm provided that f i are strongly convex and ∇f i are Lipschitz. For general convex functions, we can deduce the nonergodic rates from Theorem 10

$$\displaystyle\begin{array}{rcl} \left \vert \sum _{i=1}^{m}f_{ i}(x_{i}^{k}) - f(x^{{\ast}})\right \vert & =& o\left ( \frac{1} {\sqrt{k + 1}}\right ) {}\\ \sum _{\begin{array}{c}i\in V \\ j\in N_{i}\end{array}}\|x_{i}^{k} - z_{ ij}^{k}\|^{2} +\sum _{ \begin{array}{c}i\in V\\ i\in N_{j}\end{array}}\|x_{j}^{k} - z_{ ij}^{k}\|^{2}& =& o\left ( \frac{1} {k + 1}\right ), {}\\ \end{array}$$

and the ergodic rates from Theorem 9

$$\displaystyle\begin{array}{rcl} \left \vert \sum _{i=1}^{m}f_{ i}(\overline{x}_{i}^{k}) - f(x^{{\ast}})\right \vert & =& O\left ( \frac{1} {k + 1}\right ) {}\\ \sum _{\begin{array}{c}i\in V \\ j\in N_{i}\end{array}}\|\overline{x}_{i}^{k} -\overline{z}_{ ij}^{k}\|^{2} +\sum _{ \begin{array}{c}i\in V\\ i\in N_{j}\end{array}}\|\overline{x}_{j}^{k} -\overline{z}_{ ij}^{k}\|^{2}& =& O\left ( \frac{1} {(k + 1)^{2}}\right ). {}\\ \end{array}$$

These convergence rates are new and complement the linear convergence results in [51]. In addition, they complement the similar ergodic rate derived in [54] for a different distributed splitting.

Rights and permissions

Reprints and Permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Davis, D., Yin, W. (2016). Convergence Rate Analysis of Several Splitting Schemes. In: Glowinski, R., Osher, S., Yin, W. (eds) Splitting Methods in Communication, Imaging, Science, and Engineering. Scientific Computation. Springer, Cham. https://doi.org/10.1007/978-3-319-41589-5_4

Download citation