Skip to main content

Fast and safe: accelerated gradient methods with optimality certificates and underestimate sequences

Abstract

In this work we introduce the concept of an Underestimate Sequence (UES), which is motivated by Nesterov’s estimate sequence. Our definition of a UES utilizes three sequences, one of which is a lower bound (or under-estimator) of the objective function. The question of how to construct an appropriate sequence of lower bounds is addressed, and we present lower bounds for strongly convex smooth functions and for strongly convex composite functions, which adhere to the UES framework. Further, we propose several first order methods for minimizing strongly convex functions in both the smooth and composite cases. The algorithms, based on efficiently updating lower bounds on the objective functions, have natural stopping conditions that provide the user with a certificate of optimality. Convergence of all algorithms is guaranteed through the UES framework, and we show that all presented algorithms converge linearly, with the accelerated variants enjoying the optimal linear rate of convergence.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

References

  1. Allen-Zhu, Z., Qu, Z., Richtarik, P., Yuan, Y.: Even faster accelerated coordinate descent using non-uniform sampling. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of The 33rd International Conference on Machine Learning, Volume 48 of Proceedings of Machine Learning Research, pp. 1110–1119, New York, USA (2016). PMLR

  2. Baes, M.: Estimate sequence methods: extensions and approximations. Technical Report Optimization-Online 2372, Université Catholique de Louvain (2009)

  3. Bubeck, S., Lee, Y.T., Singh, M.: A geometric alternative to Nesterov’s accelerated gradient descent. Technical report, Microsoft Research (2015). arXiv:1506.08187 [math.OC]

  4. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 1–27 (2011)

    Article  Google Scholar 

  5. Chen, S., Ma, S., Liu, W.: Geometric descent method for convex composite minimization. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 636–644. Curran Associates Inc, Red Hook (2017)

    Google Scholar 

  6. Cotter, A., Shamir, O., Srebro, N., Sridharan, K.: Better mini-batch algorithms via accelerated gradient methods. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24, pp. 1647–1655. Curran Associates Inc., Red Hook (2011)

    Google Scholar 

  7. Diakonikolas, J.: The approximate duality gap technique: a unified theory of first-order methods. SIAM J. Optim. 29(1), 660–689 (2019)

    MathSciNet  Article  Google Scholar 

  8. Drusvyatskiy, D., Fazel, M., Roy, S.: An optimal first order method based on optimal quadratic averaging. SIAM J. Optim. 28(1), 251–271 (2018)

    MathSciNet  Article  Google Scholar 

  9. Fercoq, O., Richtárik, P.: Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015). https://doi.org/10.1137/130949993

    MathSciNet  Article  MATH  Google Scholar 

  10. Fountoulakis, K., Tappenden, R.: A flexible coordinate descent method. Comput. Optim. Appl. 70(2), 351–394 (2018)

    MathSciNet  Article  Google Scholar 

  11. Ghadimi, S.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework. SIAM J. Optim. 22(4), 1469–1492 (2012). https://doi.org/10.1137/110848876

    MathSciNet  Article  MATH  Google Scholar 

  12. Jaggi, M., Smith, V., Takac, M., Terhorst, J., Krishnan, S., Hofmann, T., Jordan, M.I.: Communication-efficient distributed dual coordinate ascent. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 3068–3076. Curran Associates Inc, Red Hook (2014)

    Google Scholar 

  13. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 26, pp. 315–323. Curran Associates Inc., Red Hook (2013)

    Google Scholar 

  14. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR), 7–9 May 2015, San Diego, USA (2014). arXiv:1412.6980

  15. Lin, H., Mairal, J., Harchaoui, Z.: A universal catalyst for first-order optimization. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 3384–3392. Curran Associates Inc, Red Hook (2015)

    Google Scholar 

  16. Lu, Z., Xiao, L.: On the complexity analysis of randomized block-coordinate descent methods. Math. Program. 152, 615–642 (2015). https://doi.org/10.1007/s10107-014-0800-2

    MathSciNet  Article  MATH  Google Scholar 

  17. Ma, C., Jaggi, M., Curtis, F.E., Srebro, N., Takáč, M.: An accelerated communication-efficient primal-dual optimization framework for structured machine learning. Technical Report, Lehigh University, USA (2017). arXiv:1711.05305 [math.OC]

  18. Ma, C., Smith, V., Jaggi, M., Jordan, M., Richtarik, P., Takac, M.: Adding vs. averaging in distributed primal-dual optimization. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning, Volume 37 of Proceedings of Machine Learning Research, pp. 1973–1982, Lille, France, 07–09 (2015). PMLR

  19. Nesterov, Yu.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)

    MathSciNet  Article  Google Scholar 

  20. Nesterov, Y.: A method for solving the convex programming problem with convergence rate \(o(1/k^2)\). Dokl. Akad. Nauk SSSR 269(3), 543–547 (1983)

    MathSciNet  Google Scholar 

  21. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization. Springer (Originally published by Kluwer Academic Publishers), Berlin (2004). https://doi.org/10.1007/978-1-4419-8853-9

  22. Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005). https://doi.org/10.1007/s10107-004-0552-5

    MathSciNet  Article  MATH  Google Scholar 

  23. Nesterov, Y.: Gradient methods for minimizing composite objective function. CORE Discussion Paper 2007/76, Université Catholique de Louvain (2007)

  24. Nesterov, Y.: Accelerating the cubic regularization of newton’s method on convex problems. Math. Program. 112(1), 159–181 (2008)

    MathSciNet  Article  Google Scholar 

  25. Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013). https://doi.org/10.1007/s10107-012-0629-5

    MathSciNet  Article  MATH  Google Scholar 

  26. Nitanda, A.: Stochastic proximal gradient descent with acceleration techniques. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 1574–1582. Curran Associates Inc, Red Hook (2014)

    Google Scholar 

  27. O’Donoghue, B., Candes, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715–732 (2015)

    MathSciNet  Article  Google Scholar 

  28. Richtárik, P., Takàč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144(1–2), 1–38 (2014)

    MathSciNet  Article  Google Scholar 

  29. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 400–407 (1951)

  30. Schmidt, M., Roux, N.L., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017). https://doi.org/10.1007/s10107-016-1030-6

    MathSciNet  Article  MATH  Google Scholar 

  31. Shalev-Shwartz, Shai, Zhang, T.: Accelerated mini-batch stochastic dual coordinate ascent. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 26, pp. 378–385. Curran Associates Inc, Red Hook (2013)

  32. Tappenden, R., Richtárik, P., Gondzio, J.: Inexact coordinate descent: complexity and preconditioning. J. Optim. Theory Appl. 170, 144–176 (2016)

    MathSciNet  Article  Google Scholar 

  33. Tappenden, R., Richtárik, P., Gondzio, J.: Inexact coordinate descent: complexity and preconditioning. J. Optim. Theory Appl. 170(1), 144–176 (2016)

    MathSciNet  Article  Google Scholar 

Download references

Acknowledgements

This work was partially supported by the U.S. National Science Foundation, under award numbers NSF:CCF:1618717 and NSF:CCF:1740796.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rachael Tappenden.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: A comparison of a UES and Nesterov’s ES

Here we briefly compare the definition of an underestimate sequence with Nesterov’s definition of an estimate sequence. The original definition of an ES only applied to smooth functions that satisfy Assumption 1 so we restrict our discussion to this case. Moreover, we will assume that the same sequence \(\{\alpha _k\}_{k=0}^{\infty }\) is used when discussing a UES and an ES. Consider the following definition.

Definition 2

(Definition 2.2.1 in [21]) A pair of sequences \(\{\phi _k^N(x)\}_{k=0}^{\infty }\) and \(\{\lambda _k^N\}_{k=0}^{\infty }\) \(\lambda _k^N \ge 0\) is called an estimate sequence of function f(x) if

  1. (i)

    \(\lambda _k^N \rightarrow 0\); and

  2. (ii)

    for any \(x \in {\mathbb {R}}^n\) and all \(k\ge 0\) we have \(\phi _k^N(x) \le (1-\lambda _k^N)f(x) + {{\lambda _k^N}} \phi _0^N(x).\)

The definition is general and does not say anything about convergence. With this in mind, Nesterov’s ES is coupled with the following lemma.

Lemma 10

(Lemma 2.2.1 in [21]) If for some sequence \(\{x_k\}\) we have

$$\begin{aligned} f(x_k) \le (\phi _k^N)^* :=\min _{x\in {\mathbb {R}}^n} \phi _k^N(x), \end{aligned}$$
(57)

then \(f(x_k) - f^* \le \lambda _k^N (\phi _0^N(x^*) - f^*)\rightarrow 0\).

The first observation to make is the clear difference between f and \(\phi _k(x)\) for a UES and an ES. In particular, for a UES, at every iteration it holds that \(\phi _k(x)\le F(x)\) (4) so every \(\phi _k(x)\) (\(k\ge 0\)) is a global underestimate (global lower bound) of the objective function for all \(x\in {\mathbb {R}}^n\). However, for an ES, the function \(\phi _k^N(x)\) is not necessarily a global upper bound for f and it is necessarily not a global lower bound for f. What must hold is that, at iteration k, the minimizer of the approximation function \(\phi _k^N(x)\) must be at least as large as \(f(x_k)\) (the value of the objective function at the current point).

A second major differences that one observes is as follows. If an algorithm generates a series of sequences that form a UES (satisfying Definition 1), then (assuming \(\sum _{k=0}^{\infty } \alpha _k = \infty \)), the algorithm is guaranteed to converge (see Proposition 1). On the other hand, if an algorithm generates a series of sequences that form an ES (satisfying Definition 2), there is no such algorithm convergence guarantee. This statement is made concrete by considering the next lemma and the text that follows it.

Lemma 11

(Lemma 2.2.2 in [21]) Assume that (1) f satisfies Assumption 1, (2) \(\phi _0^N(x)\) is an arbitrary function on \({\mathbb {R}}^n\), (3) \(\{y_k\}_{k=0}^{\infty }\) is an arbitrary sequence in \({\mathbb {R}}^n\), (4) \(\{\alpha _k\}_{k=0}^{\infty }\) with \(\alpha _k \in (0,1)\) and \(\sum _{k=0}^{\infty } \alpha _k = \infty \) and (5) \(\lambda _0^N = 1\). Then the pair of sequences \(\{\phi _k^N(x)\}_{k=0}^{\infty }\) and \(\{\lambda _k^N\}_{k=0}^{\infty }\) recursively defined by

$$\begin{aligned} \lambda _{k+1}^N & = (1-\alpha _k)\lambda _k^N, \end{aligned}$$
(58)
$$\begin{aligned} \phi _{k+1}^N(x) & = (1-\alpha _k)\phi _k^N(x)\nonumber \\&\quad+\, \alpha _k\Big (f(y_k) + \langle \nabla f(y_k),x-y_k \rangle + \tfrac{\mu }{2}\Vert {x-y_k}\Vert ^2\Big ) \end{aligned}$$
(59)

is an estimate sequence.

Combining (36), with (33) and (16) shows that \(\phi _{k+1}(x)\) in (36) is equivalent to \(\phi _{k+1}^N(x)\) in (59) and therefore, the construction in this work is an estimate sequence (\(\phi _{k+1}(x) \equiv \phi _{k+1}^N(x)\)). However, we will now show that the construction does not satisfy (57), and therefore, even though the iterates generated by SUESA/ASUESA form an estimate sequence, Lemma 10cannot be used to prove convergence of SUESA/ASUESA.

Lemma 12

(Lemma 2.2.3 in [21]) Let \(\phi _0^N(x) = (\phi _0^N)^* + \frac{\gamma _0^N}{2}\Vert {x-v_0^N}\Vert ^2\). Then the process described in Lemma 11preserves the canonical form of functions \(\{\phi _k^N(x)\}_{k=0}^{\infty }\):

$$\begin{aligned} \phi _{k+1}^N(x) = (\phi _{k+1}^N)^* + \tfrac{\gamma _k^N}{2}\Vert {x - {{v_{k+1}^N}}}\Vert ^2, \end{aligned}$$
(60)

where the sequences \(\{\gamma _k^N\}_{k=0}^{\infty }\), \(\{v_k^N\}_{k=0}^{\infty }\) and \(\{(\phi _k^N)^*\}_{k=0}^{\infty }\) are defined as follows:

$$\begin{aligned} \gamma _{k+1}^N & = (1-\alpha _k)\gamma _k^N + \alpha _k \mu , \end{aligned}$$
(61)
$$\begin{aligned} v_{k+1}^N & = \tfrac{1}{\gamma _{k+1}^N}\Big ((1-\alpha _k)\gamma _k^Nv_k^N + \alpha _k \mu y_k - \alpha _k \nabla f(y_k)\Big ), \end{aligned}$$
(62)
$$\begin{aligned} (\phi _{k+1}^N)^* & = (1-\alpha _k)(\phi _k^N)^* + \alpha _k f(y_k) - \tfrac{\alpha _k^2}{2 \gamma _{k+1}^N}\Vert {\nabla f(y_k)}\Vert ^2 \nonumber \\&\quad +\, \tfrac{\alpha _k(1-\alpha _k){{\gamma _k^N}}}{\gamma _{k+1}^N}\Big (\tfrac{\mu }{2}\Vert {y_k - v_k^N}\Vert ^2 + \langle \nabla f(y_k), v_k^N - y_k\rangle \Big ) \end{aligned}$$
(63)

Note that in (60) the numerator in front of the norm term is \(\gamma _k^N\), while in (37) it is \(\mu \). Substituting \(\gamma _k^N = \mu \) into (61) gives \(\gamma _{k+1}^N = \gamma _k^N = \mu \) so setting \(\gamma _0^N = \mu \) ensures \(\gamma _k^N\) is fixed for all \(k\ge 0\) in Lemma 12. Now, using \(\gamma _{k+1}^N = \gamma _k^N = \mu \) in (62), and recalling the form of a long step shows that \(v_{k+1}^N \equiv v_{k+1}\) in (25). It remains to observe that (38) (combined with (16)) is equivalent to (63).

Thus, the difference between the construction in this work and the construction in [21] comes down to the minimizer and minimal value of \(\phi _0^{N}(x)\). For an ES, it must hold that \(f(x_0)\le (\phi _0^{N})^*\) , and the initialization of scheme (2.2.6) in [21], as well as the proof of Theorem 2.2.1, explicitly mentioned the use of the choice \(v_0 = x_0\) for Nesterov’s method. However, note that other choices of \(v_0\) can still provide the equality \((\phi _0^{N})^* = f(x_0)\), which means that the minimal value of the first element in the ES would be unchanged, i.e., \(f(x_0)\), and the minimizer would be shifted from \(x_0\) to other points. . On the other hand, this contrasts with SUESA/ASUESA, where it is required that \(\phi _0(x) \le f(x)\) and so they are initialized with \(v_0 = x_0^{++}\) and \(\phi _0^*\) \(= f(x_0) - \tfrac{1}{2\mu }\Vert {\nabla f(x_0)}\Vert ^2 {\le f(x_0)}\); see (35).

Finally, note that Definition 1 also holds for composite functions. While Definition 2 only holds for smooth functions, Nesterov has extended the ES framework to the composite setting; see Section 4 in [24] and Section 4 in [25]. Moreover, the relationship between the OQA method and an ES is discussed "Appendix 1" in [8].

Appendix 2: Comparison of a UES and the study [7]

In [7], the authors proposed a general scheme for the analysis of first-order methods. For the strongly convex cases (both smooth and nonsmooth), there are some similarities between the methods in [7] and our proposed methods, but we stress that the approaches in this work are inherently different from those in [7]. To be more specific, Table 3 summarizes the iterates of ASUESA, and compares them with the iterates (for the smooth and strongly convex setting) in [7]. The similarities and differences between the corresponding two studies are described as follows:

  1. 1.

    The \(\beta _k\)s are different for each method, and if the problem is ill-conditioned (large \(\kappa \)), then the \(\beta _k\)s are close to each other (\(\approx 1-\dfrac{1}{\sqrt{\kappa }}\)).

  2. 2.

    The \(y_k\)s have the same update structure, but because the \(\beta _k\)s are different, this results in different updates \(y_k\) (later we will see that the \(v_k\)s and \(x_k\)s are also different as the algorithms progress).

  3. 3.

    The \(x_{k+1}\)s are identical in structure. However, again the iterates \(x_{k+1}\) are different because the \(\beta _k\)s, and subsequently \(y_k\)s, are different for both methods.

  4. 4.

    The \(v_{k+1}\)s are different both in structure and clearly in their values.

  5. 5.

    The \(x_k\)s in our proposed study form part of an underestimate sequence, which guarantees the natural stopping criteria, i.e., \(f(x_k) - \phi _k^*\) goes to zero in a linear rate.

Table 3 Comparison of the iterates

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Jahani, M., Gudapati, N.V.C., Ma, C. et al. Fast and safe: accelerated gradient methods with optimality certificates and underestimate sequences. Comput Optim Appl 79, 369–404 (2021). https://doi.org/10.1007/s10589-021-00269-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10589-021-00269-4

Keywords

  • Underestimate sequence
  • Estimate sequence
  • Quadratic averaging
  • Lower bounds
  • Strongly convex
  • Smooth minimization
  • Composite minimization
  • Accelerated algorithms

Mathematics Subject Classification

  • 90C25
  • 90C47
  • 68Q25