Advertisement

Stochastic Subgradient Method Converges on Tame Functions

  • Damek Davis
  • Dmitriy DrusvyatskiyEmail author
  • Sham Kakade
  • Jason D. Lee
Article
  • 37 Downloads

Abstract

This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity? We prove that the stochastic subgradient method, on any semialgebraic locally Lipschitz function, produces limit points that are all first-order stationary. More generally, our result applies to any function with a Whitney stratifiable graph. In particular, this work endows the stochastic subgradient method, and its proximal extension, with rigorous convergence guarantees for a wide class of problems arising in data science—including all popular deep learning architectures.

Keywords

Subgradient Proximal Stochastic subgradient method Differential inclusion Lyapunov function Semialgebraic Tame 

Mathematics Subject Classification

65K05 65K10 34A60 90C15 

Notes

References

  1. 1.
    M. Abadi, A. Agarwal, P. Barham, E. Brevdo, et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.Google Scholar
  2. 2.
    M. Benaïm, J. Hofbauer, and S. Sorin. Stochastic approximations and differential inclusions. SIAM J. Control Optim., 44(1):328–348, 2005.MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    M. Benaïm, J. Hofbauer, and S. Sorin. Stochastic approximations and differential inclusions. II. Applications. Math. Oper. Res., 31(4):673–695, 2006.MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    J. Bolte, A. Daniilidis, A.S. Lewis, and M. Shiota. Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 18(2):556–572, 2007.MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    V.S. Borkar. Stochastic approximation. Cambridge University Press, Cambridge; Hindustan Book Agency, New Delhi, 2008. A dynamical systems viewpoint.Google Scholar
  6. 6.
    J.M. Borwein and X. Wang. Lipschitz func tions with maximal Clarke subdifferentials are generic. Proc. Amer. Math. Soc., 128(11):3221–3229, 2000.MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    H. Brézis. Opérateurs maximaux monotones et semi-groupes de contraction dans des espaces de Hilbert. North-Holland Math. Stud. 5, North-Holland, Amsterdam, 1973.Google Scholar
  8. 8.
    R.E. Bruck, Jr. Asymptotic convergence of nonlinear contraction semigroups in Hilbert space. J. Funct. Anal., 18:15–26, 1975.MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    J.V. Burke, X. Chen, and H. Sun. Subdifferentiation and smoothing of nonsmooth integral functionals. Preprint, Optimization-Online, May 2017.Google Scholar
  10. 10.
    F.H. Clarke. Generalized gradients and applications. Trans. Amer. Math. Soc., 205:247–262, 1975.MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    F.H. Clarke. Optimization and nonsmooth analysis, volume 5 of Classics in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, second edition, 1990.Google Scholar
  12. 12.
    F.H. Clarke, Y.S. Ledyaev, R.J. Stern, and P.R. Wolenski. Nonsmooth analysis and control theory, volume 178. Springer Science & Business Media, 2008.Google Scholar
  13. 13.
    M. Coste. An introduction to o-minimal geometry. RAAG Notes, 81 pages, Institut de Recherche Mathématiques de Rennes, November 1999.Google Scholar
  14. 14.
    M. Coste. An Introduction to Semialgebraic Geometry. RAAG Notes, 78 pages, Institut de Recherche Mathématiques de Rennes, October 2002.Google Scholar
  15. 15.
    D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. To Appear in SIAM J. Optim., arXiv:1803.06523, 2018.
  16. 16.
    D. Davis and D. Drusvyatskiy. Stochastic subgradient method converges at the rate \({O}(k^{-1/4})\) on weakly convex functions. arXiv:1802.02988, 2018.
  17. 17.
    A. Dembo. Probability theory: Stat310/math230 september 3, 2016. 2016. Available at http://statweb.stanford.edu/~adembo/stat-310b/lnotes.pdf.
  18. 18.
    D. Drusvyatskiy, A.D. Ioffe, and A.S. Lewis. Curves of descent. SIAM J. Control Optim., 53(1):114–138, 2015.MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    J.C. Duchi and F. Ruan. Stochastic methods for composite optimization problems. Preprint arXiv:1703.08570, 2017.
  20. 20.
    S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim., 23(4):2341–2368, 2013.MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    A.D. Ioffe. Critical values of set-valued maps with stratifiable graphs. Extensions of Sard and Smale-Sard theorems. Proc. Amer. Math. Soc., 136(9):3111–3119, 2008.Google Scholar
  22. 22.
    A.D. Ioffe. An invitation to tame optimization. SIAM J. Optim., 19(4):1894–1917, 2008.MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    A.D. Ioffe. Variational analysis of regular mappings. Springer Monographs in Mathematics. Springer, Cham, 2017. Theory and applications.Google Scholar
  24. 24.
    S. Kakade and J.D. Lee. Provably correct automatic subdifferentiation for qualified programs. arXiv preprint arXiv:1809.08530, 2018.
  25. 25.
    K.A. Khan and P.I. Barton. Evaluating an element of the Clarke generalized Jacobian of a composite piecewise differentiable function. ACM Trans. Math. Software, 39(4):Art. 23, 28, 2013.Google Scholar
  26. 26.
    K.A. Khan and P.I. Barton. A vector forward mode of automatic differentiation for generalized derivative evaluation. Optimization Methods and Software, 30(6):1185–1212, 2015.MathSciNetCrossRefzbMATHGoogle Scholar
  27. 27.
    H.J. Kushner and G.G. Yin. Stochastic approximation and recursive algorithms and applications, volume 35 of Applications of Mathematics (New York). Springer-Verlag, New York, second edition, 2003. Stochastic Modelling and Applied Probability.Google Scholar
  28. 28.
    S. Łojasiewicz. Ensemble semi-analytiques. IHES Lecture Notes, 1965.Google Scholar
  29. 29.
    S. Majewski, B. Miasojedow, and E. Moulines. Analysis of nonsmooth stochastic approximation: the differential inclusion approach. Preprint arXiv:1805.01916, 2018.
  30. 30.
    A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim., 19(4):1574–1609, 2008.MathSciNetCrossRefzbMATHGoogle Scholar
  31. 31.
    A.S. Nemirovsky and D.B. Yudin. Problem complexity and method efficiency in optimization. A Wiley-Interscience Publication. John Wiley & Sons, Inc., New York, 1983.Google Scholar
  32. 32.
    E.A. Nurminskii. Minimization of nondifferentiable functions in the presence of noise. Cybernetics, 10(4):619–621, Jul 1974.MathSciNetCrossRefGoogle Scholar
  33. 33.
    A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.Google Scholar
  34. 34.
    H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics, 22:400–407, 1951.MathSciNetCrossRefzbMATHGoogle Scholar
  35. 35.
    R.T. Rockafellar. The theory of subgradients and its applications to problems of optimization, volume 1 of R & E. Heldermann Verlag, Berlin, 1981.Google Scholar
  36. 36.
    R.T. Rockafellar and R.J-B. Wets. Variational Analysis. Grundlehren der mathematischen Wissenschaften, Vol 317, Springer, Berlin, 1998.Google Scholar
  37. 37.
    G.V. Smirnov. Introduction to the theory of differential inclusions, volume 41 of Graduate Studies in Mathematics. American Mathematical Society, Providence, RI, 2002.Google Scholar
  38. 38.
    T. Tao. An introduction to measure theory, volume 126. American Mathematical Soc., 2011.Google Scholar
  39. 39.
    L. van den Dries and C. Miller. Geometric categories and o-minimal structures. Duke Math. J., 84:497–540, 1996.MathSciNetCrossRefzbMATHGoogle Scholar
  40. 40.
    H. Whitney. A function not constant on a connected set of critical points. Duke Math. J., 1(4):514–517, 12 1935.Google Scholar
  41. 41.
    A.J. Wilkie. Model completeness results for expansions of the ordered field of real numbers by restricted Pfaffian functions and the exponential function. J. Amer. Math. Soc., 9(4):1051–1094, 1996.MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© SFoCM 2018

Authors and Affiliations

  • Damek Davis
    • 1
  • Dmitriy Drusvyatskiy
    • 2
    Email author
  • Sham Kakade
    • 3
  • Jason D. Lee
    • 4
  1. 1.School of Operations Research and Information EngineeringCornell UniversityIthacaUSA
  2. 2.Department of MathematicsUniversity of WashingtonSeattleUSA
  3. 3.Departments of Statistics and Computer ScienceUniversity of WashingtonSeattleUSA
  4. 4.Data Science and Operations Department, Marshall School of BusinessUniversity of Southern CaliforniaLos AngelesUSA

Personalised recommendations