Advertisement

Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition

  • Hamed Karimi
  • Julie Nutini
  • Mark SchmidtEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9851)

Abstract

In 1963, Polyak proposed a simple condition that is sufficient to show a global linear convergence rate for gradient descent. This condition is a special case of the Łojasiewicz inequality proposed in the same year, and it does not require strong convexity (or even convexity). In this work, we show that this much-older Polyak-Łojasiewicz (PL) inequality is actually weaker than the main conditions that have been explored to show linear convergence rates without strong convexity over the last 25 years. We also use the PL inequality to give new analyses of coordinate descent and stochastic gradient for many non-strongly-convex (and some non-convex) functions. We further propose a generalization that applies to proximal-gradient methods for non-smooth optimization, leading to simple proofs of linear convergence for support vector machines and L1-regularized least squares without additional assumptions.

Keywords

Gradient descent Coordinate descent Stochastic gradient Variance-reduction Boosting Support vector machines L1-regularization 

Notes

Acknowledgments

We would like to thank Simon LaCoste-Julien and Martin Takáč for valuable discussions. This research was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC RGPIN-06068-2015). Julie Nutini is funded by a UBC Four Year Doctoral Fellowship (4YF) and Hamed Karimi is support by a Mathematics of Information Technology and Complex Systems (MITACS) Elevate Fellowship.

Supplementary material

431503_1_En_50_MOESM1_ESM.pdf (231 kb)
Supplementary material 1 (pdf 230 KB)

References

  1. 1.
    Agarwal, A., Negahban, S.N., Wainwright, M.J.: Fast global convergence rates of gradient methods for high-dimensional statistical recovery. Ann. Statist. 40, 2452–2482 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Anitescu, M.: Degenerate nonlinear programming with a quadratic growth condition. SIAM J. Optim. 10, 1116–1135 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Attouch, H., Bolte, J.: On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math. Program. Ser. B 116, 5–16 (2009)Google Scholar
  4. 4.
    Bach, F., Moulines, E.: Non-strongly-convex smooth stochastic approximation with convergence rate \(O(1/n)\). In: Advances in Neural Information Processing Systems (NIPS), pp. 773–791 (2013)Google Scholar
  5. 5.
    Ben-Israel, A., Mond, B.: What is invexity? J. Austral. Math. Soc. 28, 1–9 (1986)MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From Error Bounds to the Complexity of First-Order Descent Methods for Convex Functions. arXiv:1510.08234 (2015)
  7. 7.
    Craven, B.D., Glover, B.M.: Invex functions and duality. J. Austral. Math. Soc. 39, 1–20 (1985)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Dinuzzo, F., Ong, C.S., Gehler, P., Pillonetto, G.: Learning output kernels with block coordinate descent. In: Proceedings of the 28th ICML, pp. 49–56 (2011)Google Scholar
  9. 9.
    Garber, D., Hazan, E.: Faster rates for the Frank-Wolfe method over strongly-convex sets. In: Proceedings of the 32nd ICML, pp. 541–549 (2015)Google Scholar
  10. 10.
    Garber, D., Hazan, E.: Faster and Simple PCA via Convex Optimization. arXiv:1509.05647v4 (2015)
  11. 11.
    Gu, M., Lim, L.-H., Wu, C.J.: ParNes: a rapidly convergent algorithm for accurate recovery of sparse and approximately sparse signals. Numer. Algor. 64, 321–347 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Hanson, M.A.: On sufficiency of the Kuhn-Tucker conditions. J. Math. Anal. Appl. 80, 545–550 (1981)MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Hoffman, A.J.: On approximate solutions of systems of linear inequalities. J. Res. Nat. Bur. Stand. 49, 263–265 (1952)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Hou, K., Zhou, Z., So, A.M.-C., Luo, Z.-Q.: On the linear convergence of the proximal gradient method for trace norm regularization. In: Advances in Neural Information Processing Systems (NIPS), pp. 710–718 (2013)Google Scholar
  15. 15.
    Hush, D., Kelly, P., Scovel, C., Steinwart, I.: QP algorithms with guaranteed accuracy and run time for support vector machines. J. Mach. Learn. Res. 7, 733–769 (2006)MathSciNetzbMATHGoogle Scholar
  16. 16.
    Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems (NIPS), pp. 315–323 (2013)Google Scholar
  17. 17.
    Kadkhodaie, M. Sanjabi, M., Luo, Z.-Q.: On the Linear Convergence of the Approximate Proximal Splitting Method for Non-Smooth Convex Optimization. arXiv:1404.5350v1 (2014)
  18. 18.
    Li, G., Pong, T.K.: Calculus of the Exponent of Kurdyka-Łojasiewicz Inequality and its Applications to Linear Convergence of First-Order Methods. arXiv:1602.02915v1 (2016)
  19. 19.
    Liu, J., Wright, S.J.: Asynchronous stochastic coordinate descent: parallelism and convergence properties. SIAM J. Optim. 25, 351–376 (2015)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Liu, J., Wright, S.J., Ré, C., Bittorf, V., Sridhar, S.: An Asynchronous Parallel Stochastic Coordinate Descent Algorithm. arXiv:1311.1873v3 (2014)
  21. 21.
    Łojasiewicz, S.: A Topological Property of Real Analytic Subsets (in French). Coll. du CNRS, Les équations aux dérivées partielles, vol. 117, pp. 87–89 (1963)Google Scholar
  22. 22.
    Luo, Z.-Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general approach. Ann. Oper. Res. 46, 157–178 (1993)MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Ma, C., Tappenden, T., Takáč, M.: Linear Convergence of the Randomized Feasible Descent Method Under the Weak Strong Convexity Assumption. arXiv:1506.02530 (2015)
  24. 24.
    Meir, R., Rätsch, G.: An introduction to boosting and leveraging. In: Mendelson, S., Smola, A.J. (eds.) Advanced Lectures on Machine Learning. LNCS (LNAI), vol. 2600, pp. 118–183. Springer, Heidelberg (2003). doi: 10.1007/3-540-36434-X_4 CrossRefGoogle Scholar
  25. 25.
    Necoara, I., Nesterov, Y., Glineur, F.: Linear Convergence of First Order Methods for Non-Strongly Convex Optimization. arXiv:1504.06298v3 (2015)
  26. 26.
    Necoara, I., Clipici, D.: Parallel random coordinate descent method for composite minimization: convergence analysis and error bounds. SIAM J. Optim. 26, 197–226 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
  27. 27.
    Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19, 1574–1609 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, Dordrecht (2004)CrossRefzbMATHGoogle Scholar
  29. 29.
    Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22, 341–362 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    Nutini, J., Schmidt, M., Laradji, I.H., Friedlander, M., Koepke, H.: Coordinate descent converges faster with the Gauss-Southwell rule than random selection. In: Proceedings of the 32nd ICML, pp. 1632–1641 (2015)Google Scholar
  31. 31.
    Polyak, B.T.: Gradient methods for minimizing functionals. Zh. Vychisl. Mat. Mat. Fiz. 3, 643–653 (1963). (in Russian)MathSciNetzbMATHGoogle Scholar
  32. 32.
    Riedmiller, M., Braun, H.: RPROP - a fast adaptive learning algorithm. In: Proceedings of ISCIS VII (1992)Google Scholar
  33. 33.
    Roux, N.L., Schmidt, M., Bach, F.R.: A stochastic gradient method with an exponential convergence rate for finite training sets. In: Advances in Neural Information Processing Systems (NIPS), pp. 2672–2680 (2012)Google Scholar
  34. 34.
    Schmidt, M., Roux, N.L., Bach, F.R.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: Advances in Neural Information Processing Systems (NIPS), pp. 1458–1466 (2011)Google Scholar
  35. 35.
    Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)MathSciNetzbMATHGoogle Scholar
  36. 36.
    Reddi, S.J., Sra, S., Poczos, B., Smola, A.: Fast Incremental Method for Nonconvex Optimization. arXiv:1603.06159 (2016)
  37. 37.
    Reddi, S.J., Hefny, A., Sra, S., Poczos, B., Smola, A.: Stochastic Variance Reduction for Nonconvex Optimization. arXiv:1603.06160 (2016)
  38. 38.
    Reddi, S.J., Sra, S., Poczos, B., Smola, A.,: Fast Stochastic Methods for Nonsmooth Nonconvex Optimization. arXiv:1605.06900 (2016)
  39. 39.
    Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. Ser. A 144, 1–38 (2014)Google Scholar
  40. 40.
    Tseng, P.: Approximation accuracy, gradient methods, and error bound for structured convex optimization. Math. Program. Ser. B 125, 263–295 (2010)Google Scholar
  41. 41.
    Tseng, P., Yun, S.: Block-coordinate gradient descent method for linearly constrained nonsmooth separable optimization. J. Optim. Theory Appl. 140, 513–535 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  42. 42.
    Wang, P.-W., Lin, C.-J.: Iteration complexity of feasible descent methods for convex optimization. J. Mach. Learn. Res. 15, 1523–1548 (2014)MathSciNetzbMATHGoogle Scholar
  43. 43.
    Xiao, L., Zhang, T.: A proximal-gradient homotopy method for the sparse least-squares problem. SIAM J. Optim. 23, 1062–1091 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  44. 44.
    Zhang, H.: The Restricted Strong Convexity Revisited: Analysis of Equivalence to Error Bound and Quadratic Growth. arXiv:1511.01635 (2015)
  45. 45.
    Zhang, H., Yin, W.: Gradient Methods for Convex Minimization: Better Rates Under Weaker Conditions. arXiv:1303.4645v2 (2013)

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of British ColumbiaVancouverCanada

Personalised recommendations