Abstract
Employing the ideas of non-linear preconditioning and testing of the classical proximal point method, we formalise common arguments in convergence rate and convergence proofs of optimisation methods to the verification of a simple iteration-wise inequality. When applied to fixed point operators, the latter can be seen as a generalisation of firm non-expansivity or the \(\alpha \)-averaged property. The main purpose of this work is to provide the abstract background theory for our companion paper “Block-proximal methods with spatially adapted acceleration”. In the present account we demonstrate the effectiveness of the general approach on several classical algorithms, as well as their stochastic variants. Besides, of course, the proximal point method, these method include the gradient descent, forward–backward splitting, Douglas–Rachford splitting, Newton’s method, as well as several methods for saddle-point problems, such as the Alternating Directions Method of Multipliers, and the Chambolle–Pock method.
Similar content being viewed by others
References
Martinet, B.: Brève communication. Régularisation d’inéquations variationnelles par approximations successives. ESAIM 4(R3), 154–158 (1970)
Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J. Optim. 14(5), 877–898 (1976). https://doi.org/10.1137/0314056
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009). https://doi.org/10.1137/080716542
Loris, I., Verhoeven, C.: On a generalization of the iterative soft-thresholding algorithm for the case of non-separable penalty. Inverse Probl. 27(12), 125,007 (2011). https://doi.org/10.1088/0266-5611/27/12/125007
Gabay, D.: Applications of the method of multipliers to variational inequalities. In: M. Fortin, R. Glowinski (eds.) Augmented Lagrangian Methods: Applications to the Numerical Solution of Boundary-Value Problems, vol. 15, pp. 299–331. North-Holland (1983)
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40, 120–145 (2011). https://doi.org/10.1007/s10851-010-0251-1
He, B., Yuan, X.: Convergence analysis of primal-dual algorithms for a saddle-point problem: from contraction perspective. SIAM J. Imaging Sci. 5(1), 119–149 (2012). https://doi.org/10.1137/100814494
Valkonen, T., Pock, T.: Acceleration of the PDHGM on partially strongly convex functions. J. Math. Imaging Vis. 59, 394–414 (2017). https://doi.org/10.1007/s10851-016-0692-2
Valkonen, T.: Block-proximal methods with spatially adapted acceleration (2017). http://tuomov.iki.fi/m/blockcp.pdf (Submitted)
Browder, F.E.: Nonexpansive nonlinear operators in a banach space. Proc. Natl. Acad. Sci. USA 54(4), 1041 (1965)
Wright, S.: Coordinate descent algorithms. Math. Progr. 151(1), 3–34 (2015). https://doi.org/10.1007/s10107-015-0892-3
Censor, Y., Zenios, S.A.: Proximal minimization algorithm withd-functions. J. Optim. Theory Appl. 73(3), 451–464 (1992). https://doi.org/10.1007/BF00940051
Chen, G., Teboulle, M.: Convergence analysis of a proximal-like minimization algorithm using bregman functions. SIAM J. Optim. 3(3), 538–543 (1993). https://doi.org/10.1137/0803026
Lorenz, D., Pock, T.: An inertial forward-backward algorithm for monotone inclusions. J. Math. Imaging Vis. 51(2), 311–325 (2015). https://doi.org/10.1007/s10851-014-0523-2
Hohage, T., Homann, C.: A generalization of the Chambolle-Pock algorithm to Banach spaces with applications to inverse problems (2014) (Preprint)
Hua, X., Yamashita, N.: Block coordinate proximal gradient methods with variable bregman functions for nonsmooth separable optimization. Math. Program. 160(1), 1–32 (2016). https://doi.org/10.1007/s10107-015-0969-z
Brezis, H., Crandall, M.G., Pazy, A.: Perturbations of nonlinear maximal monotone sets in banach space. Commun. Pure Appl. Math. 23(1), 123–144 (1970). https://doi.org/10.1002/cpa.3160230107
Opial, Z.: Weak convergence of the sequence of successive approximations for nonexpansive mappings. Bull. Am. Math. Soc. 73(4), 591–597 (1967). https://doi.org/10.1090/S0002-9904-1967-11761-0
Browder, F.E.: Convergence theorems for sequences of nonlinear operators in banach spaces. Math. Z. 100(3), 201–225 (1967). https://doi.org/10.1007/BF01109805
Daubechies, I., Defrise, M., De Mol, C.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. 57(11), 1413–1457 (2004). https://doi.org/10.1002/cpa.20042
Douglas Jim, J., Rachford, H.H.J.: On the numerical solution of heat conduction problems in two and three space variables. Trans. Am. Math. Soc. 82(2), 421–439 (1956). https://doi.org/10.2307/1993056
Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 2 edn. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC. Springer, New York (2017). https://doi.org/10.1007/978-3-319-48311-5
Lee, J.D., Sun, Y., Saunders, M.A.: Proximal Newton-type methods for minimizing composite functions. SIAM J. Optim. 24(3), 1420–1443 (2014). https://doi.org/10.1137/130921428
Mann, W.R.: Mean value methods in iteration. Proc. Am. Math. Soc. 4(3), 506–510 (1953)
Schaefer, H.: Über die methode sukzessiver approximationen. Jahresbericht der Deutschen Mathematiker-Vereinigung 59, 131–140 (1957)
Petryshyn, W.: Construction of fixed points of demicompact mappings in Hilbert space. J. Math. Anal. Appl. 14(2), 276–284 (1966)
Krasnoselski, M.A.: Two remarks about the method of successive approximations. Uspekhi Mat. Nauk. 19, 123–127 (1955)
Shiriyaev, A.N.: Probability. Graduate Texts in Mathematics. Springer, New York (1996)
Qu, Z., Richtárik, P., Takáč, M., Fercoq, O.: SDNA: stochastic dual Newton ascent for empirical risk minimization (2015)
Pilanci, M., Wainwright, M.J.: Iterative Hessian sketch: fast and accurate solution approximation for constrained least-squares. J. Mach. Learn. Res. 17(53), 1–38 (2016)
Condat, L.: A primal-dual splitting method for convex optimization involving lipschitzian, proximable and linear composite terms. J. Optim. Theory Appl. 158(2), 460–479 (2013). https://doi.org/10.1007/s10957-012-0245-9
Vũ, B.C.: A splitting algorithm for dual monotone inclusions involving cocoercive operators. Adv. Comput. Math. 38(3), 667–681 (2013). https://doi.org/10.1007/s10444-011-9254-8
Chambolle, A., Pock, T.: On the ergodic convergence rates of a first-order primal–dual algorithm. Math. Progr. (2015). https://doi.org/10.1007/s10107-015-0957-3
Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Math. Progr. (2015). https://doi.org/10.1007/s10107-015-0901-6
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Outer Semicontinuity of Maximal Monotone Operators
We could not find the following result explicitly stated in the literature, although it is hidden in, e.g., the proof of [2, Theorem 1].
Lemma A.1
Let \(H: U\rightrightarrows U\) be maximal monotone on a Hilbert space \(U\). Then H is is weak-to-strong outer semicontinuous: for any sequence \(\{u^i\}_{i \in \mathbb {N}}\), and any \(z^i \in H(u^i)\) such that \(u^i\mathrel {\rightharpoonup }u\) weakly, and \(z^i \rightarrow z\) strongly, we have \(z \in H(u)\).
Proof
By monotonicity, for any \(u' \in U\) and \(z' \in U\) holds \(D_i :=\langle u'-u^i,z'-z^i\rangle \ge 0\). Since a weakly convergent sequence is bounded, we have \(D_i \ge \langle u'-u^i,z'-z\rangle -C\Vert z-z^i\Vert \) for some \(C>0\) independent of i. Taking the limit, we therefore have \(\langle u'-u,z'-z\rangle \ge 0\). If we had \(z \not \in H(u)\), this would contradict that H is maximal, i.e., its graph not contained in the graph of any monotone operator. \(\square \)
Appendix B: Three-Point Inequalities
The following three-point formulas are central to handling forward steps with respect to smooth functions.
Lemma B.1
If \(J \in \mathrm {cpl}(X)\) has L-Lipschitz gradient. Then
as well as
Proof
Regarding the “three-point hypomonotonicity” (74), the L-Lipschitz gradient implies co-coercivity (see [22] or Appendix C)
Thus using Cauchy’s inequality
To prove (75), the Lipschitz gradient implies the smoothness or “descent inequality” (again, [22] or Appendix C)
By convexity \(J({\widehat{x}})-J(z) \ge \langle \nabla J(z),{\widehat{x}}-z\rangle \). Summed, we obtain (75). \(\square \)
Lemma B.2
If \(J \in \mathrm {cpl}(X)\) has L-Lipschitz gradient and is \(\gamma \)-strongly convex. Then for any \(\tau >0\) holds
as well as
Proof
To prove (78), using strong convexity,the Lipschitz gradient, and Cauchy’s inequality, we have
Regarding (77), using the \(\gamma \)-strong monotonicity of \(\nabla J\), we estimate completely analogously
\(\square \)
Since smooth functions with a positive Hessian are locally convex, the above lemmas readily extend to this case, locally. In fact, we have following more precise result:
Lemma B.3
Suppose \(J \in C^2(X)\) with \(\nabla ^2 J({\widehat{x}}) > 0\) at given \({\widehat{x}}\in X\). Then for any \(\tau \in (0, 2]\) and all \(z, x, \eta \in X\), we have
with
If \(x \in {{\mathrm{cl}}}B(\Vert z-{\widehat{x}}\Vert , {\widehat{x}})\), then also
Proof
By Taylor expansion, for some \(\zeta \) between z and \({\widehat{x}}\), and any \(\tau >0\), we have
Since \(\zeta \in {{\mathrm{cl}}}B(\Vert z-{\widehat{x}}\Vert , {\widehat{x}})\), by the definition of \(\delta _{z,\eta }\), we obtain (79).
Similarly, by Taylor expansion, for some \(\zeta _0\) between x and \({\widehat{x}}\), we have
Using (82) we obtain
Using the assumption \(x \in {{\mathrm{cl}}}B(\Vert z-{\widehat{x}}\Vert , {\widehat{x}})\), we have \(\zeta _0 \in {{\mathrm{cl}}}B(\Vert z-{\widehat{x}}\Vert , {\widehat{x}})\). Hence we obtain (81) by the definition of \(\delta _{z,\eta }\) and \((1-\delta _{z,\eta })(2-\tau )-(1+\delta _{z,\eta })=(1-\delta _{z,\eta })(1-\tau )-2\delta _{z,\eta }\). \(\square \)
We can also derive the following alternate result:
Lemma B.4
Suppose \(J \in C^2(X)\) with \(\nabla ^2 J({\widehat{x}}) > 0\) at given \({\widehat{x}}\in X\). Then for all \(z, x, \eta \in X\) we have
for \(\delta _{z,\eta }\) given by (80). If \(x \in {{\mathrm{cl}}}B(\Vert z-{\widehat{x}}\Vert , {\widehat{x}})\), then also
Proof
By Taylor expansion, for some \(\zeta \) between z and \({\widehat{x}}\), we have
In the last step we have used Cauchy’s inequality, and the definition of \(\delta _{z,\eta }\) following \(\zeta \in {{\mathrm{cl}}}B(\Vert z-{\widehat{x}}\Vert , {\widehat{x}})\). The standard three-point or Pythagoras’ identity states
Applying this in (86), we obtain (84).
To prove (85), we use (83), the definition of \(\delta _{z,\eta }\), and (84). \(\square \)
Appendix C: Projected Gradients and Smoothness
The next lemma generalises well-known properties (see, e.g., [22]) of smooth convex functions to projected gradients, when we take P as projection operator. With P a random projection, taking the expectation in (89), we in particular obtain a connection to the Expected Separable Over-approximation property in the stochastic coordinate descent literature [34].
Lemma C.1
Let \(J \in \mathrm {cpl}(X)\), and \(P \in \mathcal {L}(X; X)\) be self-adjoint and positive semi-definite on a Hilbert space X. Suppose P has a pseudo-inverse \(P^\dag \) satisfying \( P P^\dag P = P\). Consider the properties:
-
(i)
P-relative Lipschitz continuity of \(\nabla J\) with factor L:
$$\begin{aligned} \Vert \nabla J(x)-\nabla J(y)\Vert _P \le L \Vert x-y\Vert _{P^\dag } \quad (x, y \in X). \end{aligned}$$(87) -
(ii)
The P-relative property
$$\begin{aligned} \langle \nabla J(x+Ph) - \nabla J(x),Ph\rangle \le L\Vert h\Vert _P^2 \quad (x, h \in X). \end{aligned}$$(88) -
(iii)
P-relative smoothness of J with factor L:
$$\begin{aligned} J(x+Ph) \le J(x) + \langle \nabla J(x),Ph\rangle +\frac{L}{2}\Vert h\Vert _P^2 \quad (x, h \in X). \end{aligned}$$(89) -
(iv)
The P-relative property
$$\begin{aligned} J(y) \le J(x) + \langle \nabla J(y),y-x\rangle -\frac{1}{2L}\Vert \nabla J(x)-\nabla J(y)\Vert _P^2 \quad (x, h \in X). \end{aligned}$$(90) -
(v)
P-relative co-coercivity of \(\nabla J\) with factor \(L^{-1}\):
$$\begin{aligned} L^{-1} \Vert \nabla J(x)-\nabla J(y)\Vert _P^2 \le \langle \nabla J(x)-\nabla J(y),x-y\rangle \quad (x, y \in X). \end{aligned}$$(91)
We have (i) \(\implies \) (ii) \(\iff \) (iii) \(\implies \) (iv) \(\implies \) (v). If P is invertible, all are equivalent.
Proof
(i) \(\implies \) (ii): Take \(y=x+Ph\) and multiply (87) by \(\Vert h\Vert _P\). Then use Cauchy–Schwarz.
(ii) \(\implies \) (iii): Using the mean value theorem and (88), we compute (89):
(iii) \(\implies \) (ii): Add together (89) for \(x=x'\) and \(x=x'+Ph\).
(iii) \(\implies \) (iv): Adding \(-\langle \nabla J(y),x+Ph\rangle \) on both sides of (89), we get
The left hand side is minimised with respect to x by taking \(x=y-Ph\). Taking on the right-hand side \(h=L^{-1}(\nabla J(y)-\nabla J(x))\) therefore gives (90).
(iv) \(\implies \) (v): Summing the estimate (90) with the same estimate with x and y exchanged, we obtain (91).
(v) \(\implies \) (i) when P is invertible: Cauchy–Schwarz. \(\square \)
Rights and permissions
About this article
Cite this article
Valkonen, T. Testing and Non-linear Preconditioning of the Proximal Point Method. Appl Math Optim 82, 591–636 (2020). https://doi.org/10.1007/s00245-018-9541-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00245-018-9541-6