Skip to main content

Stochastic Subgradient Method Converges on Tame Functions

Abstract

This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity? We prove that the stochastic subgradient method, on any semialgebraic locally Lipschitz function, produces limit points that are all first-order stationary. More generally, our result applies to any function with a Whitney stratifiable graph. In particular, this work endows the stochastic subgradient method, and its proximal extension, with rigorous convergence guarantees for a wide class of problems arising in data science—including all popular deep learning architectures.

This is a preview of subscription content, access via your institution.

Fig. 1

Notes

  1. The zero mean assumption on \(\xi _k\) is not for free when f is given in expectation form \(f(x) = {\mathbb {E}}\left[ f(x, \omega ) \right] \) and we choose \(y_k + \xi _k \in \partial f(x_k, \omega _k)\) with \(\omega _k \sim P\). It is true under certain circumstances [11, Theorem 2.7.2], [9], but verifying its validity in general remains an open and difficult question. In deterministic settings, a principled automatic differentiation approach for computing Clarke subgradients has been proposed in [24,25,26].

  2. Concurrent to this work, the independent preprint [29] also provides convergence guarantees for the stochastic projected subgradient method, under the assumption that the objective function is “subdifferentially regular” and the constraint set is convex. Subdifferential regularity rules out functions with downward kinks and cusps, such as deep networks with the Relu(\(\cdot \)) activation functions. Besides subsuming the subdifferentially regular case, the results of the current paper apply to the broad class of Whitney stratifiable functions, which includes all popular deep network architectures.

  3. The term “tame” used in the title has a technical meaning. Tame sets are those whose intersection with any ball is definable in some o-minimal structure. The manuscript [22] provides a nice exposition on the role of tame sets and functions in optimization.

  4. In the assumption, replace \(x_k\) with \(w_k\), since we now use \(w_k\) to denote the stochastic subgradient iterates.

References

  1. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.

  2. M. Benaïm, J. Hofbauer, and S. Sorin. Stochastic approximations and differential inclusions. SIAM J. Control Optim., 44(1):328–348, 2005.

    Article  MathSciNet  Google Scholar 

  3. M. Benaïm, J. Hofbauer, and S. Sorin. Stochastic approximations and differential inclusions. II. Applications. Math. Oper. Res., 31(4):673–695, 2006.

    Article  MathSciNet  Google Scholar 

  4. J. Bolte, A. Daniilidis, A.S. Lewis, and M. Shiota. Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 18(2):556–572, 2007.

    Article  MathSciNet  Google Scholar 

  5. V.S. Borkar. Stochastic approximation. Cambridge University Press, Cambridge; Hindustan Book Agency, New Delhi, 2008. A dynamical systems viewpoint.

  6. J.M. Borwein and X. Wang. Lipschitz func tions with maximal Clarke subdifferentials are generic. Proc. Amer. Math. Soc., 128(11):3221–3229, 2000.

    Article  MathSciNet  Google Scholar 

  7. H. Brézis. Opérateurs maximaux monotones et semi-groupes de contraction dans des espaces de Hilbert. North-Holland Math. Stud. 5, North-Holland, Amsterdam, 1973.

  8. R.E. Bruck, Jr. Asymptotic convergence of nonlinear contraction semigroups in Hilbert space. J. Funct. Anal., 18:15–26, 1975.

    Article  MathSciNet  Google Scholar 

  9. J.V. Burke, X. Chen, and H. Sun. Subdifferentiation and smoothing of nonsmooth integral functionals. Preprint, Optimization-Online, May 2017.

  10. F.H. Clarke. Generalized gradients and applications. Trans. Amer. Math. Soc., 205:247–262, 1975.

    Article  MathSciNet  Google Scholar 

  11. F.H. Clarke. Optimization and nonsmooth analysis, volume 5 of Classics in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, second edition, 1990.

  12. F.H. Clarke, Y.S. Ledyaev, R.J. Stern, and P.R. Wolenski. Nonsmooth analysis and control theory, volume 178. Springer Science & Business Media, 2008.

  13. M. Coste. An introduction to o-minimal geometry. RAAG Notes, 81 pages, Institut de Recherche Mathématiques de Rennes, November 1999.

  14. M. Coste. An Introduction to Semialgebraic Geometry. RAAG Notes, 78 pages, Institut de Recherche Mathématiques de Rennes, October 2002.

  15. D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. To Appear in SIAM J. Optim., arXiv:1803.06523, 2018.

  16. D. Davis and D. Drusvyatskiy. Stochastic subgradient method converges at the rate \({O}(k^{-1/4})\) on weakly convex functions. arXiv:1802.02988, 2018.

  17. A. Dembo. Probability theory: Stat310/math230 september 3, 2016. 2016. Available at http://statweb.stanford.edu/~adembo/stat-310b/lnotes.pdf.

  18. D. Drusvyatskiy, A.D. Ioffe, and A.S. Lewis. Curves of descent. SIAM J. Control Optim., 53(1):114–138, 2015.

    Article  MathSciNet  Google Scholar 

  19. J.C. Duchi and F. Ruan. Stochastic methods for composite optimization problems. Preprint arXiv:1703.08570, 2017.

  20. S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim., 23(4):2341–2368, 2013.

    Article  MathSciNet  Google Scholar 

  21. A.D. Ioffe. Critical values of set-valued maps with stratifiable graphs. Extensions of Sard and Smale-Sard theorems. Proc. Amer. Math. Soc., 136(9):3111–3119, 2008.

    Article  MathSciNet  Google Scholar 

  22. A.D. Ioffe. An invitation to tame optimization. SIAM J. Optim., 19(4):1894–1917, 2008.

    Article  MathSciNet  Google Scholar 

  23. A.D. Ioffe. Variational analysis of regular mappings. Springer Monographs in Mathematics. Springer, Cham, 2017. Theory and applications.

  24. S. Kakade and J.D. Lee. Provably correct automatic subdifferentiation for qualified programs. arXiv preprint arXiv:1809.08530, 2018.

  25. K.A. Khan and P.I. Barton. Evaluating an element of the Clarke generalized Jacobian of a composite piecewise differentiable function. ACM Trans. Math. Software, 39(4):Art. 23, 28, 2013.

    Article  MathSciNet  Google Scholar 

  26. K.A. Khan and P.I. Barton. A vector forward mode of automatic differentiation for generalized derivative evaluation. Optimization Methods and Software, 30(6):1185–1212, 2015.

    Article  MathSciNet  Google Scholar 

  27. H.J. Kushner and G.G. Yin. Stochastic approximation and recursive algorithms and applications, volume 35 of Applications of Mathematics (New York). Springer-Verlag, New York, second edition, 2003. Stochastic Modelling and Applied Probability.

  28. S. Łojasiewicz. Ensemble semi-analytiques. IHES Lecture Notes, 1965.

  29. S. Majewski, B. Miasojedow, and E. Moulines. Analysis of nonsmooth stochastic approximation: the differential inclusion approach. Preprint arXiv:1805.01916, 2018.

  30. A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim., 19(4):1574–1609, 2008.

    Article  MathSciNet  Google Scholar 

  31. A.S. Nemirovsky and D.B. Yudin. Problem complexity and method efficiency in optimization. A Wiley-Interscience Publication. John Wiley & Sons, Inc., New York, 1983.

    Google Scholar 

  32. E.A. Nurminskii. Minimization of nondifferentiable functions in the presence of noise. Cybernetics, 10(4):619–621, Jul 1974.

    Article  MathSciNet  Google Scholar 

  33. A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.

  34. H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics, 22:400–407, 1951.

    Article  MathSciNet  Google Scholar 

  35. R.T. Rockafellar. The theory of subgradients and its applications to problems of optimization, volume 1 of R & E. Heldermann Verlag, Berlin, 1981.

  36. R.T. Rockafellar and R.J-B. Wets. Variational Analysis. Grundlehren der mathematischen Wissenschaften, Vol 317, Springer, Berlin, 1998.

  37. G.V. Smirnov. Introduction to the theory of differential inclusions, volume 41 of Graduate Studies in Mathematics. American Mathematical Society, Providence, RI, 2002.

  38. T. Tao. An introduction to measure theory, volume 126. American Mathematical Soc., 2011.

  39. L. van den Dries and C. Miller. Geometric categories and o-minimal structures. Duke Math. J., 84:497–540, 1996.

    Article  MathSciNet  Google Scholar 

  40. H. Whitney. A function not constant on a connected set of critical points. Duke Math. J., 1(4):514–517, 12 1935.

    Article  MathSciNet  Google Scholar 

  41. A.J. Wilkie. Model completeness results for expansions of the ordered field of real numbers by restricted Pfaffian functions and the exponential function. J. Amer. Math. Soc., 9(4):1051–1094, 1996.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dmitriy Drusvyatskiy.

Additional information

Communicated by Michael Overton.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Research of Dmitriy Drusvyatskiy was supported by the AFOSR YIP Award FA9550-15-1-0237 and by the NSF DMS 1651851 and CCF 1740551 Awards. Sham Kakade acknowledges funding from the Washington Research Foundation Fund for Innovation in Data-Intensive Discovery and the NSF CCF 1740551 Award. Jason D. Lee acknowledges funding from the ARO MURI Award W911NF-11-1-0303.

A Proofs for the Proximal Extension

A Proofs for the Proximal Extension

In this section, we follow the notation of Sect. 6. Namely, we let \(\zeta :{\mathbb {R}}^d \times \Omega \rightarrow {\mathbb {R}}^d\) be the stochastic subgradient oracle and \(T_{(\cdot )}(\cdot ) : (0, \infty ) \times {\mathbb {R}}^d \rightarrow {\mathbb {R}}^d\) the proximal selection. Throughout, we let \(x_k\) and \(\omega _k\) be generated by the proximal stochastic subgradient method (6.4) and suppose that Assumption E holds. Let \({\mathcal {F}}_k :=\sigma (x_j,\omega _{j-1}: j\le k)\) be the sigma algebra generated by the history of the algorithm.

Let us now formally define the normal cone constructions of variational analysis. For any point \(x\in {\mathcal {X}}\), the proximal normal cone to \({\mathcal {X}}\) at x is the set

$$\begin{aligned} N^P_{{\mathcal {X}}}(x):=\{\lambda v\in {\mathbb {R}}^d: x\in \mathrm{proj}_{{\mathcal {X}}}(x+v),\lambda \ge 0\}, \end{aligned}$$

where \(\mathrm{proj}_{{\mathcal {X}}}(\cdot )\) denotes the nearest point map to \({\mathcal {X}}\). The limiting normal cone to \({\mathcal {X}}\) at x, denoted \(N^L_{\mathcal {X}}(x)\), consists of all vector \(v\in {\mathbb {R}}^d\) such that there exist sequences \(x_i\in {{\mathcal {X}}}\) and \(v_i\in N^P_{{\mathcal {X}}}(x_i)\) satisfying \((x_i,v_i)\rightarrow (x,v)\). The Clarke normal cone to \({\mathcal {X}}\) at x is then simply

$$\begin{aligned} N_{{\mathcal {X}}}(x):=\mathrm {cl}\,\text {conv}\,N^L_{\mathcal {X}}(x). \end{aligned}$$

1.1 A. 1 Auxiliary Lemmas

In this subsection, we record a few auxiliary lemmas to be used in the sequel.

Lemma A.1

There exists a function \(L :{\mathbb {R}}^d \rightarrow {\mathbb {R}}_+\), which is bounded on bounded sets, such that for any \(x, v \in {\mathbb {R}}^d\), and \(\alpha > 0\), we have

$$\begin{aligned} \alpha ^{-1} \Vert x - x_+\Vert \le 2\cdot L(x)+ 2\cdot \Vert v\Vert , \end{aligned}$$

where we set \(x_+ := T_{\alpha }(x - \alpha v)\).

Proof

Let \(L(\cdot )\) be the function from Property E.2. From the definition of the proximal map, we deduce

$$\begin{aligned}&\tfrac{1}{2\alpha } \Vert x_+ - x\Vert ^2\le g(x) - g(x_+) -\langle v, x_{+} - x\rangle \le L(x)\cdot \Vert x_+ - x\Vert + \Vert v\Vert \cdot \Vert x_{+} - x\Vert , \end{aligned}$$

where the second inequality follows by Property E.2 if \(g(x_+) \le g(x)\); otherwise if \(g(x_+) \ge g(x)\), then \(g(x) - g(x_+) \le 0 \le L(x) \Vert x_+ - x\Vert \) trivially. Dividing both sides by \(\tfrac{1}{2}\Vert x_+ - x\Vert \) yields the result. \(\square \)

Lemma A.2

Let \(\{z_k\}_{k\ge 1}\) be a bounded sequence in \({\mathbb {R}}^d\) and let \(\{\beta _k\}_{k\ge 1}\) be a nonnegative sequence satisfying \( \sum _{k=1}^\infty \beta _k^2 < \infty . \) Then almost surely over \(\omega \sim P\), we have \(\beta _k \zeta (z_k, \omega )\rightarrow 0\).

Proof

Notice that because \(\{z_k\}_{k\ge 1}\) is bounded, it follows that \(\{p(z_k)\}\) is bounded. Now consider the random variable \(X_k = \beta _k^2 \Vert \zeta (z_k, \cdot )\Vert ^2\). Due to the estimate

$$\begin{aligned} \sum _{k=1}^\infty {\mathbb {E}}\left[ X_k\right] \le \sum _{k=1}^\infty \beta _k^2p(z_k) <\infty , \end{aligned}$$

standard results in measure theory (e.g., [38, Exercise 1.5.5]) imply that \(X_k \rightarrow 0\) almost surely. \(\square \)

Lemma A.3

Almost surely, we have \(\alpha _k\Vert \zeta (x_k, \omega _k)\Vert \rightarrow 0\) as \(k \rightarrow \infty \).

Proof

From the variance bound, \({\mathbb {E}}\left[ \Vert X - {\mathbb {E}}\left[ X\right] \Vert ^2\right] \le {\mathbb {E}}\left[ \Vert X\Vert ^2\right] \), and Assumption E, we have

$$\begin{aligned} {\mathbb {E}}\left[ \Vert \zeta (x_k, \omega _{k}) - {\mathbb {E}}\left[ \zeta (x_k, \omega _{k}) \mid {\mathcal {F}}_k \right] \Vert ^2\mid {\mathcal {F}}_k \right] \le {\mathbb {E}}\left[ \Vert \zeta (x_k, \omega _{k})\Vert ^2\mid {\mathcal {F}}_k \right] \le p(x_k). \end{aligned}$$

Therefore, the following infinite sum is a.s. finite:

$$\begin{aligned} \sum _{i=1}^\infty \alpha _i^2 {\mathbb {E}}\left[ \Vert \zeta (x_i, \omega _{i}) - {\mathbb {E}}\left[ \zeta (x_i, \omega _{i})\mid {\mathcal {F}}_i \right] \Vert ^2\mid {\mathcal {F}}_i \right] \le \sum _{i=1}^\infty \alpha _i^2 p(x_i) < \infty . \end{aligned}$$

Define the \(L^2\) martingale \(X_k = \sum _{i=1}^k \alpha _i (\zeta (x_i, \omega _i) - {\mathbb {E}}\left[ \zeta (x_i, \omega _i) \mid {\mathcal {F}}_i\right] )\). Thus, the limit \(\langle X\rangle _{\infty }\) of the predictable compensator

$$\begin{aligned} \langle X\rangle _k := \sum _{i=1}^k \alpha _i^2 {\mathbb {E}}\left[ \Vert \zeta (x_i, \omega _{i}) - {\mathbb {E}}\left[ \zeta (x_i, \omega _{i})\mid {\mathcal {F}}_i \right] \Vert ^2\mid {\mathcal {F}}_i \right] , \end{aligned}$$

exists. Applying [17, Theorem 5.3.33(a)], we deduce that almost surely \(X_k\) converges to a finite limit, which directly implies \(\alpha _k\Vert \zeta (x_k, \omega _{k}) - {\mathbb {E}}\left[ \zeta (x_k, \omega _{k})\mid {\mathcal {F}}_k \right] \Vert \rightarrow 0\) almost surely as \(k \rightarrow \infty \). Therefore, since \(\alpha _k \Vert {\mathbb {E}}\left[ \zeta (x_k, \omega _{k})\mid {\mathcal {F}}_k \right] \Vert \le \alpha _k{\mathbb {E}}\left[ \Vert \zeta (x_k, \omega _{k})\Vert \mid {\mathcal {F}}_k \right] \le \alpha _k \sqrt{p(x_k)} \rightarrow 0\) almost surely as \(k \rightarrow \infty \), it follows that \(\alpha _k\Vert \zeta (x_k, \omega _{k})\Vert \rightarrow 0\) almost surely as \(k \rightarrow \infty \). \(\square \)

1.2 A.2 Proof Theorem 6.2

In addition to Assumption E, let us now suppose that Assumption F holds. Define the set-valued map \(G:Q\rightrightarrows {\mathbb {R}}^d\) by \(G=-\partial f-\partial g-N_{{\mathcal {X}}}\). We aim to apply Theorem 3.2, which would immediately imply the validity of Theorem 6.2. To this end, notice that Assumption F is exactly Assumption B for our map G. Thus we must only verify that Assumption A holds almost surely. Note that Properties A.1 and A.3 hold vacuously. Thus, we must only show that A.2, A.4, and A.5 hold. The argument we present is essentially the same as in [19, Section 3.2.2] .

For each index k, define the set-valued map

$$\begin{aligned} G_{k}(x) := -\partial f(x) - \alpha _k^{-1} \cdot {\mathbb {E}}_{\omega }\left[ x - \alpha _k \zeta (x, \omega ) - T_{\alpha _k}(x - \alpha _k\zeta (x, \omega ))\right] \end{aligned}$$

Note that \(G_k\) is a deterministic map, with k only signifying the dependence on the deterministic sequence \(\alpha _k\). Define now the noise sequence

$$\begin{aligned} \xi _{k} := \tfrac{1}{\alpha _k}\left[ T_{\alpha _k}(x_k - \alpha _k\zeta (x_k, \omega _k)) -x_k\right] -\tfrac{1}{\alpha _k}\left[ {\mathbb {E}}_{\omega }\left[ T_{\alpha _k}(x_k - \alpha _k \zeta (x_k, \omega ))-x_k\right] )\right] . \end{aligned}$$

Let us now write the proximal stochastic subgradient method in the form (3.2).

Lemma A.4

(Recursion relation) For all \(k \ge 0\), we have

$$\begin{aligned} x_{k+1} = x_k + \alpha _k\left[ y_k + \xi _k\right] \quad \text {for some }y_k \in G_k(x_k). \end{aligned}$$

Proof

Notice that for every index \(k\ge 0\), we have

$$\begin{aligned} \tfrac{1}{\alpha _k} (x_k - x_{k+1})&= \tfrac{1}{\alpha _k} \left[ x_k - T_{\alpha _k }(x_k - \alpha _k\zeta (x_k, \omega _k))\right] \\&= {\mathbb {E}}_{\omega }\left[ \zeta (x_k, \omega )\right] +\tfrac{1}{\alpha _k} {\mathbb {E}}_{\omega }\left[ x_k - \alpha _k \zeta (x_k, \omega ) - T_{\alpha _k}(x_k - \alpha _k\zeta (x_k, \omega ))\right] \\& +\, \tfrac{1}{\alpha _k}\left[ {\mathbb {E}}_{\omega }\left[ T_{\alpha _k}(x_k - \alpha _k \zeta (x_k, \omega ))\right] -T_{\alpha _k}(x_k - \alpha _k\zeta (x_k, \omega _k))\right] \\&\in -G_k(x_k) - \xi _k, \end{aligned}$$

as desired. \(\square \)

The following lemma shows that A.4 holds almost surely.

Lemma A.5

(Weighted noise sequence) The limit \(\displaystyle \lim _{n \rightarrow \infty } \sum _{i=1}^n \alpha _i \xi _i\) exists almost surely.

Proof

We first prove that \(\{\alpha _k\xi _k\}\) is an \(L_2\) martingale difference sequence, meaning that for all k, we have

$$\begin{aligned} {\mathbb {E}}\left[ \alpha _k \xi _k \mid {\mathcal {F}}_k\right] = 0 \quad \text {and} \quad \sum _{k=1}^\infty \alpha _k^2 {\mathbb {E}}\left[ \Vert \xi _k\Vert ^2 \mid {\mathcal {F}}_k \right] < \infty . \end{aligned}$$

Clearly, \(\xi _k\) has zero mean conditioned on the past, and so we need only focus on the second property. By the variance bound, \({\mathbb {E}}\left[ \Vert X - {\mathbb {E}}\left[ X\right] \Vert ^2\right] \le {\mathbb {E}}\left[ \Vert X\Vert ^2\right] \), and Lemma A.1, we have

$$\begin{aligned} {\mathbb {E}}\left[ \Vert \xi _k\Vert ^2 \mid {\mathcal {F}}_k \right]&\le \frac{1}{\alpha _k^2}{\mathbb {E}}\left[ \left\| T_{\alpha _k}(x_k - \alpha _k\zeta (x_k, \omega _k)) - x_k\right\| ^2 \mid {\mathcal {F}}_k\right] \\&\le 4\cdot L(x_k)^2+ 4\cdot {\mathbb {E}}\left[ \Vert \zeta (x_k, \omega _k)\Vert ^2 \mid {\mathcal {F}}_k\right] . \end{aligned}$$

Notice that because \(\{x_k\}\) is bounded a.s., it follows that \(\{L(x_k)\}\) and \(\{p(x_k)\}\) are bounded a.s. Therefore, because

$$\begin{aligned} \sum _{k=1}^\infty \alpha _k^2{\mathbb {E}}\left[ \Vert \zeta (x_k, \omega _k)\Vert ^2 \mid {\mathcal {F}}_k\right] \le \sum _{k=1}^\infty \alpha _k^2p(x_k) <\infty , \end{aligned}$$

it follows that \( \sum _{k=1}^\infty \alpha _k^2{\mathbb {E}}\left[ \Vert \xi _k\Vert ^2 \mid {\mathcal {F}}_k\right] < \infty , \) almost surely, as desired.

Now, define the \(L^2\) martingale \(X_k = \sum _{i=1}^k \alpha _i \xi _{i}\). Thus, the limit \(\langle X\rangle _{\infty }\) of the predictable compensator

$$\begin{aligned} \langle X\rangle _k := \sum _{i=1}^k \alpha _i^2 {\mathbb {E}}\left[ \Vert \xi _i\Vert ^2\mid {\mathcal {F}}_i \right] , \end{aligned}$$

exists. Applying [17, Theorem 5.3.33(a)], we deduce that almost surely \(X_k\) converges to a finite limit, which completes the proof of the claim. \(\square \)

Now we turn our attention to A.2.

Lemma A.6

Almost surely, the sequence \(\{y_k\}\) is bounded.

Proof

Because the sequence \(\{x_k\}\) is almost surely bounded and f is locally Lipschitz, clearly we have

$$\begin{aligned} \sup \left\{ \Vert v\Vert : v\in \bigcup _{k \ge 1} \partial f(x_k)\right\} < \infty , \end{aligned}$$

almost surely. Thus, we need only show that

$$\begin{aligned} \sup _{k \ge 1} \left\{ \left\| \frac{1}{\alpha _k} {\mathbb {E}}_{\omega }\left[ x_k - \alpha _k \zeta (x_k, \omega ) - T_{\alpha _k}(x_k - \alpha _k\zeta (x_k, \omega ))\right] \right\| \right\} < \infty , \end{aligned}$$

almost surely. To this end, by the triangle inequality and Lemma A.1, we have for any fixed \(\omega \in \Omega \) the bound

$$\begin{aligned} \left\| \tfrac{1}{\alpha _k}\left[ x_k - T_{\alpha _k}(x_k - \alpha _k\zeta (x_k, \omega ))\right] \right\| \le 2\cdot L(x_k) + 2\cdot \Vert \zeta (x_k, \omega )\Vert \end{aligned}$$

Therefore, by Jensen’s inequality, we have that

$$\begin{aligned}&\left\| \tfrac{1}{\alpha _k} {\mathbb {E}}_{\omega }\left[ x_k - \alpha _k \zeta (x_k, \omega ) - T_{\alpha _k}(x_k - \alpha _k\zeta (x_k, \omega ))\right] \right\| \\&\le 2\cdot L(x_k) + 3\cdot {\mathbb {E}}_{\omega } \left[ \Vert \zeta (x_k, \omega )\Vert \right] \\&\le 2\cdot L(x_k) + 3\cdot \sqrt{p(x_k)}, \end{aligned}$$

which is almost surely bounded for all k. Taking the supremum yields the result. \(\square \)

As the last step, we verify Item A.5.

Lemma A.7

Item 5 of Assumption A is true.

Proof

Assumption A.5 requires us to bound all subsequential averages of the direction vectors \(\{y_k\}\). Here it is more convenient to prove a more general statement, namely, that for any sequence of points \(z_k\) converging to z and \(y_{n_k} \in G_{n_k}(z_k)\), we have \( \mathrm{dist}\left( \frac{1}{n}\sum _{k=1}^n y_{n_k}, G(z)\right) \rightarrow 0 \). Assumption A.5 is then an immediate consequence.

Thus, consider any sequence \(\{z_k\} \subseteq {\mathcal {X}}\) converging to a point \(z\in {\mathcal {X}}\) and an arbitrary sequence \(w_k^f \in \partial f(z_k)\). Let \(\{n_k\}\) be an unbounded increasing sequence of indices. Observe that since G(z) is convex and using Jensen’s inequality, we have

$$\begin{aligned}&\mathrm{dist}\left( \frac{1}{n}\sum _{k=1}^n\left( -w_k^f - \frac{1}{\alpha _{n_k}} {\mathbb {E}}_{\omega } \left[ z_k - \alpha _{n_k} \zeta (z_k, \omega ) - T_{\alpha _{n_k}}(z_k - \alpha _{n_k}\zeta (z_k, \omega ))\right] \right) , G(z)\right) \\ \quad&\le \frac{1}{n}\sum _{k=1}^n{\mathbb {E}}_{\omega }\left[ \mathrm{dist}\left( -w_k^f - \frac{1}{\alpha _{n_k}} \left[ z_k - \alpha _{n_k} \zeta (z_k, \omega ) - T_{\alpha _{n_k}}(z_k - \alpha _{n_k}\zeta (z_k, \omega ))\right] , G(z)\right) \right] . \end{aligned}$$

Our goal is to prove that the right-hand side tends to zero almost surely, which directly implies validity of A.5

Our immediate goal is to apply the dominated convergence theorem to each term in the above finite sum to conclude that each term converges to zero. To that end, we must show two properties: for every fixed \(\omega \), each term in the sum tends to zero, and that each term is bounded by an integrable function. We now prove both properties.

Claim 3

Almost surely in \(\omega \sim P\), we have that

$$\begin{aligned} \mathrm{dist}\left( -w_k^f - \frac{1}{\alpha _{n_k}} \left[ z_k - \alpha _{n_k} \zeta (z_k, \omega ) - T_{\alpha _{n_k}}(z_k - \alpha _{n_k}\zeta (z_k, \omega ))\right] , G(z)\right) \rightarrow 0 \quad \text {as }~{k \rightarrow \infty }. \end{aligned}$$

Proof of Subclaim 3

Optimality conditions [36, Exercise 10.10] of the proximal subproblem imply

$$\begin{aligned} \tfrac{1}{\alpha _{n_k}} \left[ z_k - \alpha _{n_k} \zeta (z_k, \omega ) - T_{\alpha _{n_k}}(z_k - \alpha _{n_k}\zeta (z_k, \omega ))\right] = w^g_k(\omega ) + w^{\mathcal {X}}_k(\omega ), \end{aligned}$$

for some \(w^g_k(\omega )\in \partial g(T_{\alpha _{n_k}}(z_k - \alpha _{n_k} \zeta (z_k, \omega )))\) and \(w^{\mathcal {X}}_k(\omega )\in N_{{\mathcal {X}}}^{L}(T_{\alpha _{n_k}}(z_k - \alpha _{n_k} \zeta (z_k, \omega )))\), and where \(N_{{\mathcal {X}}}^{L}\) denotes the limiting normal cone. Observe that by continuity and the fact that \(\sum _{k=1}^\infty \alpha _{n_k}^2 < \infty \) and \(\alpha _{n_k} \zeta (z_k, \omega ) \rightarrow 0\) as \(k \rightarrow \infty \) a.e. (see Lemma A.2), it follows that

$$\begin{aligned} T_{\alpha _{n_k}}(z_k - \alpha _{n_k} \zeta (z_k, \omega )) \rightarrow z. \end{aligned}$$

Indeed, setting \(z_k^+ = T_{\alpha _{n_k}}(z_k - \alpha _{n_k} \zeta (z_k, \omega ))\), we have that by Lemma A.1,

$$\begin{aligned} \Vert z_k - z_{k}^+\Vert \le 2\alpha _{n_k} L(z_k) + 2\alpha _{n_k} \Vert \zeta (z_k, \omega )\Vert \rightarrow 0 \quad \text {as}~ {k \rightarrow \infty }, \end{aligned}$$

which implies that \(\lim _{k \rightarrow \infty } z_k^+ = \lim _{k \rightarrow \infty } z_k = z\).

We furthermore deduce that \(w^{\mathcal {X}}_k(\omega )\) and \(w_k^g(\omega )\) are bounded almost surely. Indeed, \(w_k^g(\omega )\) is bounded since g is locally Lipschitz and \(z_k^+\) are bounded. Moreover, Lemma A.1 implies

$$\begin{aligned} \Vert w_k^g(\omega ) + w_k^{\mathcal {X}}(\omega )\Vert = \left\| \tfrac{1}{\alpha _{n_k}} \left[ z_k - \alpha _{n_k} \zeta (z_k, \omega ) - z_k^+\right] \right\| \le 2\cdot L(z_k) + 3\cdot \sup _{k \ge 1} \Vert \zeta (z_k,\omega )\Vert . \end{aligned}$$

Observe that the right-hand side is a.s. bounded by item 6 of Assumption E. Thus, since \(w_k^g(\omega ) + w_k^{\mathcal {X}}(\omega )\) and \(w_k^g(\omega )\) are a.s. bounded, it follows that \(w_k^{\mathcal {X}}(\omega )\) must also be a.s. bounded, as desired.

Appealing to outer semicontinuity of \(\partial f, \partial g,\) and \(N^{L}_{{\mathcal {X}}}\) (e.g., [36, Proposition 6.6]), the inclusion \(N^{L}_{{\mathcal {X}}}\subset N_{{\mathcal {X}}}\), and the boundedness of \(\{w^f_k\}, \{w^g_k(\omega )\},\) and \(\{w^{\mathcal {X}}_k(\omega )\}\), it follows that

$$\begin{aligned} \mathrm{dist}( w^f_k, \partial f(z)) \rightarrow 0; \quad \mathrm{dist}( w^g_k(\omega ), \partial g(z)) \rightarrow 0;\quad \mathrm{dist}(w^{\mathcal {X}}_k(\omega ), N_{{\mathcal {X}}}(z)) \rightarrow 0, \end{aligned}$$

as \( k \rightarrow \infty \). Consequently, almost surely we have that

$$\begin{aligned}&\mathrm{dist}\left( -w_k^f - \tfrac{1}{\alpha _{n_k}} \left[ z_k - \alpha _{n_k} \zeta (z_k, \omega ) - T_{\alpha _{n_k}}(z_k - \alpha _{n_k}\zeta (z_k, \omega ))\right] , G(z)\right) \\ \quad&\le \mathrm{dist}( w^f_k, \partial f(z)) + \mathrm{dist}( w^g_k(\omega ), \partial g(z)) + \mathrm{dist}(w^{\mathcal {X}}_k(\omega ), N_{{\mathcal {X}}}(z)) \rightarrow 0, \end{aligned}$$

as desired. \(\square \)

Claim 4

Let \(L_f := \sup _{k \ge 1}\mathrm{dist}(0, \partial f(z_k))\) and \( L_g := \sup _{k \ge 1}L(z_{k})\). Then for all \(k\ge 0\), the functions

$$\begin{aligned}&\mathrm{dist}\left( -w_k^f - \tfrac{1}{\alpha _{n_k}} \left[ z_k - \alpha _{n_k} \zeta (z_k, \omega ) - T_{\alpha _{n_k}}(z_k - \alpha _{n_k}\zeta (z_k, \omega ))\right] , G(z)\right) \end{aligned}$$

are uniformly dominated by an integrable function in \(\omega \).

Proof of Subclaim 4

For each k, Lemma A.1 implies the bound

$$\begin{aligned} \left\| \tfrac{1}{\alpha _{n_k}} \left[ z_k - \alpha _{n_k} \zeta (z_k, \omega ) - T_{\alpha _{n_k}}(z_k - \alpha _{n_k}\zeta (z_k, \omega ))\right] \right\| \le 2L_g + 3\cdot \Vert \zeta (z_k, \omega )\Vert . \end{aligned}$$

Consequently, we have

$$\begin{aligned}&\mathrm{dist}\left( -w_k^f - \tfrac{1}{\alpha _{n_k}} \left[ z_k - \alpha _{n_k} \zeta (z_k, \omega ) - T_{\alpha _{n_k} r}(z_k - \alpha _{n_k}\zeta (z_k, \omega ))\right] , G(z)\right) \\ \quad&\le L_f + 2L_g + 3\cdot \Vert \zeta (z_k, \omega )\Vert + \mathrm{dist}(0, G(z))\\ \quad&\le L_f + 2L_g + 3\cdot \sup _{k \ge 1} \Vert \zeta (z_k,\omega )\Vert +\mathrm{dist}(0, \partial f(z) + \partial g(z)), \end{aligned}$$

which is integrable by Item 6 of Assumption E. \(\square \)

Applying the dominated convergence theorem, it follows that

$$\begin{aligned} {\mathbb {E}}_{\omega }\left[ \mathrm{dist}\left( -w_k^f - \tfrac{1}{\alpha _{n_k}} \left[ z_k - \alpha _{n_k} \zeta (z_k, \omega ) - T_{\alpha _{n_k}}(z_k - \alpha _{n_k}\zeta (z_k, \omega ))\right] , G(z)\right) \right] \rightarrow 0 \end{aligned}$$

as \(k \rightarrow \infty \). Notice the simple fact that for any real sequence \(b_k \rightarrow 0 \), it must be that \(\frac{1}{n} \sum _{k=1}^{n}b_k \rightarrow 0\) as \(n \rightarrow \infty \). Consequently

$$\begin{aligned}&\mathrm{dist}\left( \frac{1}{n}\sum _{k=1}^n\left( -w_k^f - \tfrac{1}{\alpha _{n_k}} {\mathbb {E}}_{\omega } \left[ z_k - \alpha _{n_k} \zeta (z_k, \omega ) - T_{\alpha _{n_k}}(z_k - \alpha _{n_k}\zeta (z_k, \omega ))\right] \right) , G(z)\right) \\ \quad&\le \frac{1}{n}\sum _{k=1}^n{\mathbb {E}}_{\omega }\left[ \mathrm{dist}\left( -w_k^f - \tfrac{1}{\alpha _{n_k}} \left[ z_k - \alpha _{n_k} \zeta (z_k, \omega ) - T_{\alpha _{n_k}}(z_k - \alpha _{n_k}\zeta (z_k, \omega ))\right] , G(z)\right) \right] \rightarrow 0 \end{aligned}$$

as \(n \rightarrow \infty \). This completes the proof. \(\square \)

We have now verified all parts of Theorem 3.1. Therefore, the proof is complete.

1.3 A.3 Verifying Assumption F for Composite Problems

Proof of Lemma 6.3

The argument is nearly identical to that of Lemma 5.2, with one additional subtlety that G is not necessarily outer-semicontinuous. Let \(z:{\mathbb {R}}^d\rightarrow {\mathcal {X}}\) be an arc. Since f, g, and \({\mathcal {X}}\) admit a chain rule, we deduce

$$\begin{aligned} (f\circ z)'(t)=\langle \partial f(z(t)),{\dot{z}}(t)\rangle \quad (g\circ z)'(t)=\langle \partial g(z(t)),{\dot{z}}(t)\rangle ,\quad \text {and}\quad 0=\langle N_{{\mathcal {X}}}(z(t)),\dot{z}(t)\rangle , \end{aligned}$$

for a.e. \(t\ge 0\). Adding the three equations yields

$$\begin{aligned} (\varphi \circ z)'(t)=-\langle G(z(t)),{\dot{z}}(t)\rangle \quad \text { for a.e. }t\ge 0. \end{aligned}$$

Suppose now that \(z(\cdot )\) satisfies \({\dot{z}}(t)\in -G(z(t))\) for a.e. \(t\ge 0\). Then the same linear algebraic argument as in Lemma 5.2 yields the equality \(\Vert {\dot{z}}(t)\Vert = \mathrm{dist}(0;G(z(t)))\) for a.e. \(t\ge 0\) and consequently the Eq. (6.6).

To complete the proof, we must only show that property 2 of Assumption F holds. To this end, suppose that z(0) is not composite critical and let \(T>0\) be arbitrary. Appealing to (6.6), clearly \(\sup _{t\in [0,T]} \varphi (z(t))\le \varphi (z(0))\). Thus we must only argue \(\varphi (z(T))<\varphi (z(0))\). According to (6.6), if this were not the case, then we would deduce \(\mathrm{dist}(0;G(z(t)))=0\) for a.e. \(t\in [0,T]\). Appealing to the equality \(\Vert \dot{z}\Vert =\mathrm{dist}(0;G(z(t)))\), we therefore conclude \(\Vert {\dot{z}}\Vert =0\) for a.e. \(t\in [0,T]\). Since \(z(\cdot )\) is absolutely continuous, it must therefore be constant \(z(\cdot )\equiv z(0)\), but this is a contradiction since \(0\notin G(z(0))\). Thus property 2 of Assumption F holds, as claimed. \(\square \)

Proof of Corollary 6.4

The result follows immediately from Lemma 6.2, once we show that Assumption F holds. Since f and g are definable in an o-minimal structure, Theorem 5.8 implies that f and g admit the chain rule. The same argument as in Theorem 5.8 moreover implies \({\mathcal {X}}\) admits the chain rule as well. Therefore, Lemma 6.3 guarantees that the descent property of Assumption F holds. Thus we must only argue the weak Sard property of Assumption F. To this end, since f, g, and \({\mathcal {X}}\) are definable in an o-minimal structure, there exist Whitney \(C^{d}\)-stratifications \({\mathcal {A}}_f\), \({\mathcal {A}}_g\), and \({\mathcal {A}}_{{\mathcal {X}}}\) of \(\mathrm{gph}\,f\), \(\mathrm{gph}\,g\), and \({\mathcal {X}}\), respectively. Let \(\Pi {\mathcal {A}}_f\) and \(\Pi {\mathcal {A}}_g\) be the Whitney stratifications of \({\mathbb {R}}^d\) obtained by applying the coordinate projection \((x,r)\mapsto x\) to each stratum in \({\mathcal {A}}_f\) and \({\mathcal {A}}_g\). Appealing to [39, Theorem 4.8], we obtain a Whitney \(C^d\)-stratification \({\mathcal {A}}\) of \({\mathbb {R}}^d\) that is compatible with \((\Pi {\mathcal {A}}_f,\Pi {\mathcal {A}}_g, {\mathcal {A}}_{\mathcal {X}})\). That is, for every strata \(M\in {\mathcal {A}}\) and \(L\in \Pi {\mathcal {A}}_f\cup \Pi {\mathcal {A}}_g\cup {\mathcal {A}}_{\mathcal {X}}\), either \(M\cap L=\emptyset \) or \(M\subseteq L\).

Consider an arbitrary stratum \(M\in {\mathcal {A}}\) intersecting \({\mathcal {X}}\) (and therefore contained in \({\mathcal {X}}\)) and a point \(x\in M\). Consider now the (unique) strata \(M_f\in \Pi {\mathcal {A}}_{f}\), \(M_g\in \Pi {\mathcal {A}}_{g}\), and \(M_{{\mathcal {X}}}\in {\mathcal {A}}_{\mathcal {X}}\) containing x. Let \({\widehat{f}}\) and \({\widehat{g}}\) be \(C^d\)-smooth functions agreeing with f and g on a neighborhood of x in \( M_f\) and \( M_g\), respectively. Appealing to (5.5), we conclude

$$\begin{aligned} \partial f(x)\subset \nabla {{\widehat{f}}}(x)+N_{ M_f}(x) \quad \text {and}\quad \partial g(x)\subset \nabla {{\widehat{g}}}(x)+N_{M_g}(x). \end{aligned}$$

The Whitney condition in turn directly implies \(N_{{\mathcal {X}}}(x)\subset N_{M_{{\mathcal {X}}}}(x).\) Hence summing yields

$$\begin{aligned} \partial f(x)+\partial g(x)+N_{{\mathcal {X}}}(x)&\subset \nabla ({{\widehat{f}}}+{{\widehat{g}}})(x)+N_{ M_f}(x)+N_{ M_g}(x)+N_{M_{{\mathcal {X}}}}(x)\\&\subset \nabla ({{\widehat{f}}}+{{\widehat{g}}})(x)+N_{M}(x), \end{aligned}$$

where the last inclusion follows from the compatibility \(M\subset M_f\) and \(M\subset M_g\). Notice that \({{\widehat{f}}}+{{\widehat{g}}}\) agrees with \(f+g\) on a neighborhood of x in M. Hence if the inclusion, \(0\in \partial f(x)+\partial g(x)+N_{{\mathcal {X}}}(x)\), holds it must be that x is a critical point of the \(C^d\)-smooth function \(f+g\) restricted to M, in the classical sense. Applying the standard Sard’s theorem to each manifold M, the result follows. \(\square \)

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Davis, D., Drusvyatskiy, D., Kakade, S. et al. Stochastic Subgradient Method Converges on Tame Functions. Found Comput Math 20, 119–154 (2020). https://doi.org/10.1007/s10208-018-09409-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10208-018-09409-5

Keywords

Mathematics Subject Classification