Abstract
This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity? We prove that the stochastic subgradient method, on any semialgebraic locally Lipschitz function, produces limit points that are all first-order stationary. More generally, our result applies to any function with a Whitney stratifiable graph. In particular, this work endows the stochastic subgradient method, and its proximal extension, with rigorous convergence guarantees for a wide class of problems arising in data science—including all popular deep learning architectures.
This is a preview of subscription content, access via your institution.

Notes
The zero mean assumption on \(\xi _k\) is not for free when f is given in expectation form \(f(x) = {\mathbb {E}}\left[ f(x, \omega ) \right] \) and we choose \(y_k + \xi _k \in \partial f(x_k, \omega _k)\) with \(\omega _k \sim P\). It is true under certain circumstances [11, Theorem 2.7.2], [9], but verifying its validity in general remains an open and difficult question. In deterministic settings, a principled automatic differentiation approach for computing Clarke subgradients has been proposed in [24,25,26].
Concurrent to this work, the independent preprint [29] also provides convergence guarantees for the stochastic projected subgradient method, under the assumption that the objective function is “subdifferentially regular” and the constraint set is convex. Subdifferential regularity rules out functions with downward kinks and cusps, such as deep networks with the Relu(\(\cdot \)) activation functions. Besides subsuming the subdifferentially regular case, the results of the current paper apply to the broad class of Whitney stratifiable functions, which includes all popular deep network architectures.
The term “tame” used in the title has a technical meaning. Tame sets are those whose intersection with any ball is definable in some o-minimal structure. The manuscript [22] provides a nice exposition on the role of tame sets and functions in optimization.
In the assumption, replace \(x_k\) with \(w_k\), since we now use \(w_k\) to denote the stochastic subgradient iterates.
References
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
M. Benaïm, J. Hofbauer, and S. Sorin. Stochastic approximations and differential inclusions. SIAM J. Control Optim., 44(1):328–348, 2005.
M. Benaïm, J. Hofbauer, and S. Sorin. Stochastic approximations and differential inclusions. II. Applications. Math. Oper. Res., 31(4):673–695, 2006.
J. Bolte, A. Daniilidis, A.S. Lewis, and M. Shiota. Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 18(2):556–572, 2007.
V.S. Borkar. Stochastic approximation. Cambridge University Press, Cambridge; Hindustan Book Agency, New Delhi, 2008. A dynamical systems viewpoint.
J.M. Borwein and X. Wang. Lipschitz func tions with maximal Clarke subdifferentials are generic. Proc. Amer. Math. Soc., 128(11):3221–3229, 2000.
H. Brézis. Opérateurs maximaux monotones et semi-groupes de contraction dans des espaces de Hilbert. North-Holland Math. Stud. 5, North-Holland, Amsterdam, 1973.
R.E. Bruck, Jr. Asymptotic convergence of nonlinear contraction semigroups in Hilbert space. J. Funct. Anal., 18:15–26, 1975.
J.V. Burke, X. Chen, and H. Sun. Subdifferentiation and smoothing of nonsmooth integral functionals. Preprint, Optimization-Online, May 2017.
F.H. Clarke. Generalized gradients and applications. Trans. Amer. Math. Soc., 205:247–262, 1975.
F.H. Clarke. Optimization and nonsmooth analysis, volume 5 of Classics in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, second edition, 1990.
F.H. Clarke, Y.S. Ledyaev, R.J. Stern, and P.R. Wolenski. Nonsmooth analysis and control theory, volume 178. Springer Science & Business Media, 2008.
M. Coste. An introduction to o-minimal geometry. RAAG Notes, 81 pages, Institut de Recherche Mathématiques de Rennes, November 1999.
M. Coste. An Introduction to Semialgebraic Geometry. RAAG Notes, 78 pages, Institut de Recherche Mathématiques de Rennes, October 2002.
D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. To Appear in SIAM J. Optim., arXiv:1803.06523, 2018.
D. Davis and D. Drusvyatskiy. Stochastic subgradient method converges at the rate \({O}(k^{-1/4})\) on weakly convex functions. arXiv:1802.02988, 2018.
A. Dembo. Probability theory: Stat310/math230 september 3, 2016. 2016. Available at http://statweb.stanford.edu/~adembo/stat-310b/lnotes.pdf.
D. Drusvyatskiy, A.D. Ioffe, and A.S. Lewis. Curves of descent. SIAM J. Control Optim., 53(1):114–138, 2015.
J.C. Duchi and F. Ruan. Stochastic methods for composite optimization problems. Preprint arXiv:1703.08570, 2017.
S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim., 23(4):2341–2368, 2013.
A.D. Ioffe. Critical values of set-valued maps with stratifiable graphs. Extensions of Sard and Smale-Sard theorems. Proc. Amer. Math. Soc., 136(9):3111–3119, 2008.
A.D. Ioffe. An invitation to tame optimization. SIAM J. Optim., 19(4):1894–1917, 2008.
A.D. Ioffe. Variational analysis of regular mappings. Springer Monographs in Mathematics. Springer, Cham, 2017. Theory and applications.
S. Kakade and J.D. Lee. Provably correct automatic subdifferentiation for qualified programs. arXiv preprint arXiv:1809.08530, 2018.
K.A. Khan and P.I. Barton. Evaluating an element of the Clarke generalized Jacobian of a composite piecewise differentiable function. ACM Trans. Math. Software, 39(4):Art. 23, 28, 2013.
K.A. Khan and P.I. Barton. A vector forward mode of automatic differentiation for generalized derivative evaluation. Optimization Methods and Software, 30(6):1185–1212, 2015.
H.J. Kushner and G.G. Yin. Stochastic approximation and recursive algorithms and applications, volume 35 of Applications of Mathematics (New York). Springer-Verlag, New York, second edition, 2003. Stochastic Modelling and Applied Probability.
S. Łojasiewicz. Ensemble semi-analytiques. IHES Lecture Notes, 1965.
S. Majewski, B. Miasojedow, and E. Moulines. Analysis of nonsmooth stochastic approximation: the differential inclusion approach. Preprint arXiv:1805.01916, 2018.
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim., 19(4):1574–1609, 2008.
A.S. Nemirovsky and D.B. Yudin. Problem complexity and method efficiency in optimization. A Wiley-Interscience Publication. John Wiley & Sons, Inc., New York, 1983.
E.A. Nurminskii. Minimization of nondifferentiable functions in the presence of noise. Cybernetics, 10(4):619–621, Jul 1974.
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics, 22:400–407, 1951.
R.T. Rockafellar. The theory of subgradients and its applications to problems of optimization, volume 1 of R & E. Heldermann Verlag, Berlin, 1981.
R.T. Rockafellar and R.J-B. Wets. Variational Analysis. Grundlehren der mathematischen Wissenschaften, Vol 317, Springer, Berlin, 1998.
G.V. Smirnov. Introduction to the theory of differential inclusions, volume 41 of Graduate Studies in Mathematics. American Mathematical Society, Providence, RI, 2002.
T. Tao. An introduction to measure theory, volume 126. American Mathematical Soc., 2011.
L. van den Dries and C. Miller. Geometric categories and o-minimal structures. Duke Math. J., 84:497–540, 1996.
H. Whitney. A function not constant on a connected set of critical points. Duke Math. J., 1(4):514–517, 12 1935.
A.J. Wilkie. Model completeness results for expansions of the ordered field of real numbers by restricted Pfaffian functions and the exponential function. J. Amer. Math. Soc., 9(4):1051–1094, 1996.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Michael Overton.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Research of Dmitriy Drusvyatskiy was supported by the AFOSR YIP Award FA9550-15-1-0237 and by the NSF DMS 1651851 and CCF 1740551 Awards. Sham Kakade acknowledges funding from the Washington Research Foundation Fund for Innovation in Data-Intensive Discovery and the NSF CCF 1740551 Award. Jason D. Lee acknowledges funding from the ARO MURI Award W911NF-11-1-0303.
A Proofs for the Proximal Extension
A Proofs for the Proximal Extension
In this section, we follow the notation of Sect. 6. Namely, we let \(\zeta :{\mathbb {R}}^d \times \Omega \rightarrow {\mathbb {R}}^d\) be the stochastic subgradient oracle and \(T_{(\cdot )}(\cdot ) : (0, \infty ) \times {\mathbb {R}}^d \rightarrow {\mathbb {R}}^d\) the proximal selection. Throughout, we let \(x_k\) and \(\omega _k\) be generated by the proximal stochastic subgradient method (6.4) and suppose that Assumption E holds. Let \({\mathcal {F}}_k :=\sigma (x_j,\omega _{j-1}: j\le k)\) be the sigma algebra generated by the history of the algorithm.
Let us now formally define the normal cone constructions of variational analysis. For any point \(x\in {\mathcal {X}}\), the proximal normal cone to \({\mathcal {X}}\) at x is the set
where \(\mathrm{proj}_{{\mathcal {X}}}(\cdot )\) denotes the nearest point map to \({\mathcal {X}}\). The limiting normal cone to \({\mathcal {X}}\) at x, denoted \(N^L_{\mathcal {X}}(x)\), consists of all vector \(v\in {\mathbb {R}}^d\) such that there exist sequences \(x_i\in {{\mathcal {X}}}\) and \(v_i\in N^P_{{\mathcal {X}}}(x_i)\) satisfying \((x_i,v_i)\rightarrow (x,v)\). The Clarke normal cone to \({\mathcal {X}}\) at x is then simply
1.1 A. 1 Auxiliary Lemmas
In this subsection, we record a few auxiliary lemmas to be used in the sequel.
Lemma A.1
There exists a function \(L :{\mathbb {R}}^d \rightarrow {\mathbb {R}}_+\), which is bounded on bounded sets, such that for any \(x, v \in {\mathbb {R}}^d\), and \(\alpha > 0\), we have
where we set \(x_+ := T_{\alpha }(x - \alpha v)\).
Proof
Let \(L(\cdot )\) be the function from Property E.2. From the definition of the proximal map, we deduce
where the second inequality follows by Property E.2 if \(g(x_+) \le g(x)\); otherwise if \(g(x_+) \ge g(x)\), then \(g(x) - g(x_+) \le 0 \le L(x) \Vert x_+ - x\Vert \) trivially. Dividing both sides by \(\tfrac{1}{2}\Vert x_+ - x\Vert \) yields the result. \(\square \)
Lemma A.2
Let \(\{z_k\}_{k\ge 1}\) be a bounded sequence in \({\mathbb {R}}^d\) and let \(\{\beta _k\}_{k\ge 1}\) be a nonnegative sequence satisfying \( \sum _{k=1}^\infty \beta _k^2 < \infty . \) Then almost surely over \(\omega \sim P\), we have \(\beta _k \zeta (z_k, \omega )\rightarrow 0\).
Proof
Notice that because \(\{z_k\}_{k\ge 1}\) is bounded, it follows that \(\{p(z_k)\}\) is bounded. Now consider the random variable \(X_k = \beta _k^2 \Vert \zeta (z_k, \cdot )\Vert ^2\). Due to the estimate
standard results in measure theory (e.g., [38, Exercise 1.5.5]) imply that \(X_k \rightarrow 0\) almost surely. \(\square \)
Lemma A.3
Almost surely, we have \(\alpha _k\Vert \zeta (x_k, \omega _k)\Vert \rightarrow 0\) as \(k \rightarrow \infty \).
Proof
From the variance bound, \({\mathbb {E}}\left[ \Vert X - {\mathbb {E}}\left[ X\right] \Vert ^2\right] \le {\mathbb {E}}\left[ \Vert X\Vert ^2\right] \), and Assumption E, we have
Therefore, the following infinite sum is a.s. finite:
Define the \(L^2\) martingale \(X_k = \sum _{i=1}^k \alpha _i (\zeta (x_i, \omega _i) - {\mathbb {E}}\left[ \zeta (x_i, \omega _i) \mid {\mathcal {F}}_i\right] )\). Thus, the limit \(\langle X\rangle _{\infty }\) of the predictable compensator
exists. Applying [17, Theorem 5.3.33(a)], we deduce that almost surely \(X_k\) converges to a finite limit, which directly implies \(\alpha _k\Vert \zeta (x_k, \omega _{k}) - {\mathbb {E}}\left[ \zeta (x_k, \omega _{k})\mid {\mathcal {F}}_k \right] \Vert \rightarrow 0\) almost surely as \(k \rightarrow \infty \). Therefore, since \(\alpha _k \Vert {\mathbb {E}}\left[ \zeta (x_k, \omega _{k})\mid {\mathcal {F}}_k \right] \Vert \le \alpha _k{\mathbb {E}}\left[ \Vert \zeta (x_k, \omega _{k})\Vert \mid {\mathcal {F}}_k \right] \le \alpha _k \sqrt{p(x_k)} \rightarrow 0\) almost surely as \(k \rightarrow \infty \), it follows that \(\alpha _k\Vert \zeta (x_k, \omega _{k})\Vert \rightarrow 0\) almost surely as \(k \rightarrow \infty \). \(\square \)
1.2 A.2 Proof Theorem 6.2
In addition to Assumption E, let us now suppose that Assumption F holds. Define the set-valued map \(G:Q\rightrightarrows {\mathbb {R}}^d\) by \(G=-\partial f-\partial g-N_{{\mathcal {X}}}\). We aim to apply Theorem 3.2, which would immediately imply the validity of Theorem 6.2. To this end, notice that Assumption F is exactly Assumption B for our map G. Thus we must only verify that Assumption A holds almost surely. Note that Properties A.1 and A.3 hold vacuously. Thus, we must only show that A.2, A.4, and A.5 hold. The argument we present is essentially the same as in [19, Section 3.2.2] .
For each index k, define the set-valued map
Note that \(G_k\) is a deterministic map, with k only signifying the dependence on the deterministic sequence \(\alpha _k\). Define now the noise sequence
Let us now write the proximal stochastic subgradient method in the form (3.2).
Lemma A.4
(Recursion relation) For all \(k \ge 0\), we have
Proof
Notice that for every index \(k\ge 0\), we have
as desired. \(\square \)
The following lemma shows that A.4 holds almost surely.
Lemma A.5
(Weighted noise sequence) The limit \(\displaystyle \lim _{n \rightarrow \infty } \sum _{i=1}^n \alpha _i \xi _i\) exists almost surely.
Proof
We first prove that \(\{\alpha _k\xi _k\}\) is an \(L_2\) martingale difference sequence, meaning that for all k, we have
Clearly, \(\xi _k\) has zero mean conditioned on the past, and so we need only focus on the second property. By the variance bound, \({\mathbb {E}}\left[ \Vert X - {\mathbb {E}}\left[ X\right] \Vert ^2\right] \le {\mathbb {E}}\left[ \Vert X\Vert ^2\right] \), and Lemma A.1, we have
Notice that because \(\{x_k\}\) is bounded a.s., it follows that \(\{L(x_k)\}\) and \(\{p(x_k)\}\) are bounded a.s. Therefore, because
it follows that \( \sum _{k=1}^\infty \alpha _k^2{\mathbb {E}}\left[ \Vert \xi _k\Vert ^2 \mid {\mathcal {F}}_k\right] < \infty , \) almost surely, as desired.
Now, define the \(L^2\) martingale \(X_k = \sum _{i=1}^k \alpha _i \xi _{i}\). Thus, the limit \(\langle X\rangle _{\infty }\) of the predictable compensator
exists. Applying [17, Theorem 5.3.33(a)], we deduce that almost surely \(X_k\) converges to a finite limit, which completes the proof of the claim. \(\square \)
Now we turn our attention to A.2.
Lemma A.6
Almost surely, the sequence \(\{y_k\}\) is bounded.
Proof
Because the sequence \(\{x_k\}\) is almost surely bounded and f is locally Lipschitz, clearly we have
almost surely. Thus, we need only show that
almost surely. To this end, by the triangle inequality and Lemma A.1, we have for any fixed \(\omega \in \Omega \) the bound
Therefore, by Jensen’s inequality, we have that
which is almost surely bounded for all k. Taking the supremum yields the result. \(\square \)
As the last step, we verify Item A.5.
Lemma A.7
Item 5 of Assumption A is true.
Proof
Assumption A.5 requires us to bound all subsequential averages of the direction vectors \(\{y_k\}\). Here it is more convenient to prove a more general statement, namely, that for any sequence of points \(z_k\) converging to z and \(y_{n_k} \in G_{n_k}(z_k)\), we have \( \mathrm{dist}\left( \frac{1}{n}\sum _{k=1}^n y_{n_k}, G(z)\right) \rightarrow 0 \). Assumption A.5 is then an immediate consequence.
Thus, consider any sequence \(\{z_k\} \subseteq {\mathcal {X}}\) converging to a point \(z\in {\mathcal {X}}\) and an arbitrary sequence \(w_k^f \in \partial f(z_k)\). Let \(\{n_k\}\) be an unbounded increasing sequence of indices. Observe that since G(z) is convex and using Jensen’s inequality, we have
Our goal is to prove that the right-hand side tends to zero almost surely, which directly implies validity of A.5
Our immediate goal is to apply the dominated convergence theorem to each term in the above finite sum to conclude that each term converges to zero. To that end, we must show two properties: for every fixed \(\omega \), each term in the sum tends to zero, and that each term is bounded by an integrable function. We now prove both properties.
Claim 3
Almost surely in \(\omega \sim P\), we have that
Proof of Subclaim 3
Optimality conditions [36, Exercise 10.10] of the proximal subproblem imply
for some \(w^g_k(\omega )\in \partial g(T_{\alpha _{n_k}}(z_k - \alpha _{n_k} \zeta (z_k, \omega )))\) and \(w^{\mathcal {X}}_k(\omega )\in N_{{\mathcal {X}}}^{L}(T_{\alpha _{n_k}}(z_k - \alpha _{n_k} \zeta (z_k, \omega )))\), and where \(N_{{\mathcal {X}}}^{L}\) denotes the limiting normal cone. Observe that by continuity and the fact that \(\sum _{k=1}^\infty \alpha _{n_k}^2 < \infty \) and \(\alpha _{n_k} \zeta (z_k, \omega ) \rightarrow 0\) as \(k \rightarrow \infty \) a.e. (see Lemma A.2), it follows that
Indeed, setting \(z_k^+ = T_{\alpha _{n_k}}(z_k - \alpha _{n_k} \zeta (z_k, \omega ))\), we have that by Lemma A.1,
which implies that \(\lim _{k \rightarrow \infty } z_k^+ = \lim _{k \rightarrow \infty } z_k = z\).
We furthermore deduce that \(w^{\mathcal {X}}_k(\omega )\) and \(w_k^g(\omega )\) are bounded almost surely. Indeed, \(w_k^g(\omega )\) is bounded since g is locally Lipschitz and \(z_k^+\) are bounded. Moreover, Lemma A.1 implies
Observe that the right-hand side is a.s. bounded by item 6 of Assumption E. Thus, since \(w_k^g(\omega ) + w_k^{\mathcal {X}}(\omega )\) and \(w_k^g(\omega )\) are a.s. bounded, it follows that \(w_k^{\mathcal {X}}(\omega )\) must also be a.s. bounded, as desired.
Appealing to outer semicontinuity of \(\partial f, \partial g,\) and \(N^{L}_{{\mathcal {X}}}\) (e.g., [36, Proposition 6.6]), the inclusion \(N^{L}_{{\mathcal {X}}}\subset N_{{\mathcal {X}}}\), and the boundedness of \(\{w^f_k\}, \{w^g_k(\omega )\},\) and \(\{w^{\mathcal {X}}_k(\omega )\}\), it follows that
as \( k \rightarrow \infty \). Consequently, almost surely we have that
as desired. \(\square \)
Claim 4
Let \(L_f := \sup _{k \ge 1}\mathrm{dist}(0, \partial f(z_k))\) and \( L_g := \sup _{k \ge 1}L(z_{k})\). Then for all \(k\ge 0\), the functions
are uniformly dominated by an integrable function in \(\omega \).
Proof of Subclaim 4
For each k, Lemma A.1 implies the bound
Consequently, we have
which is integrable by Item 6 of Assumption E. \(\square \)
Applying the dominated convergence theorem, it follows that
as \(k \rightarrow \infty \). Notice the simple fact that for any real sequence \(b_k \rightarrow 0 \), it must be that \(\frac{1}{n} \sum _{k=1}^{n}b_k \rightarrow 0\) as \(n \rightarrow \infty \). Consequently
as \(n \rightarrow \infty \). This completes the proof. \(\square \)
We have now verified all parts of Theorem 3.1. Therefore, the proof is complete.
1.3 A.3 Verifying Assumption F for Composite Problems
Proof of Lemma 6.3
The argument is nearly identical to that of Lemma 5.2, with one additional subtlety that G is not necessarily outer-semicontinuous. Let \(z:{\mathbb {R}}^d\rightarrow {\mathcal {X}}\) be an arc. Since f, g, and \({\mathcal {X}}\) admit a chain rule, we deduce
for a.e. \(t\ge 0\). Adding the three equations yields
Suppose now that \(z(\cdot )\) satisfies \({\dot{z}}(t)\in -G(z(t))\) for a.e. \(t\ge 0\). Then the same linear algebraic argument as in Lemma 5.2 yields the equality \(\Vert {\dot{z}}(t)\Vert = \mathrm{dist}(0;G(z(t)))\) for a.e. \(t\ge 0\) and consequently the Eq. (6.6).
To complete the proof, we must only show that property 2 of Assumption F holds. To this end, suppose that z(0) is not composite critical and let \(T>0\) be arbitrary. Appealing to (6.6), clearly \(\sup _{t\in [0,T]} \varphi (z(t))\le \varphi (z(0))\). Thus we must only argue \(\varphi (z(T))<\varphi (z(0))\). According to (6.6), if this were not the case, then we would deduce \(\mathrm{dist}(0;G(z(t)))=0\) for a.e. \(t\in [0,T]\). Appealing to the equality \(\Vert \dot{z}\Vert =\mathrm{dist}(0;G(z(t)))\), we therefore conclude \(\Vert {\dot{z}}\Vert =0\) for a.e. \(t\in [0,T]\). Since \(z(\cdot )\) is absolutely continuous, it must therefore be constant \(z(\cdot )\equiv z(0)\), but this is a contradiction since \(0\notin G(z(0))\). Thus property 2 of Assumption F holds, as claimed. \(\square \)
Proof of Corollary 6.4
The result follows immediately from Lemma 6.2, once we show that Assumption F holds. Since f and g are definable in an o-minimal structure, Theorem 5.8 implies that f and g admit the chain rule. The same argument as in Theorem 5.8 moreover implies \({\mathcal {X}}\) admits the chain rule as well. Therefore, Lemma 6.3 guarantees that the descent property of Assumption F holds. Thus we must only argue the weak Sard property of Assumption F. To this end, since f, g, and \({\mathcal {X}}\) are definable in an o-minimal structure, there exist Whitney \(C^{d}\)-stratifications \({\mathcal {A}}_f\), \({\mathcal {A}}_g\), and \({\mathcal {A}}_{{\mathcal {X}}}\) of \(\mathrm{gph}\,f\), \(\mathrm{gph}\,g\), and \({\mathcal {X}}\), respectively. Let \(\Pi {\mathcal {A}}_f\) and \(\Pi {\mathcal {A}}_g\) be the Whitney stratifications of \({\mathbb {R}}^d\) obtained by applying the coordinate projection \((x,r)\mapsto x\) to each stratum in \({\mathcal {A}}_f\) and \({\mathcal {A}}_g\). Appealing to [39, Theorem 4.8], we obtain a Whitney \(C^d\)-stratification \({\mathcal {A}}\) of \({\mathbb {R}}^d\) that is compatible with \((\Pi {\mathcal {A}}_f,\Pi {\mathcal {A}}_g, {\mathcal {A}}_{\mathcal {X}})\). That is, for every strata \(M\in {\mathcal {A}}\) and \(L\in \Pi {\mathcal {A}}_f\cup \Pi {\mathcal {A}}_g\cup {\mathcal {A}}_{\mathcal {X}}\), either \(M\cap L=\emptyset \) or \(M\subseteq L\).
Consider an arbitrary stratum \(M\in {\mathcal {A}}\) intersecting \({\mathcal {X}}\) (and therefore contained in \({\mathcal {X}}\)) and a point \(x\in M\). Consider now the (unique) strata \(M_f\in \Pi {\mathcal {A}}_{f}\), \(M_g\in \Pi {\mathcal {A}}_{g}\), and \(M_{{\mathcal {X}}}\in {\mathcal {A}}_{\mathcal {X}}\) containing x. Let \({\widehat{f}}\) and \({\widehat{g}}\) be \(C^d\)-smooth functions agreeing with f and g on a neighborhood of x in \( M_f\) and \( M_g\), respectively. Appealing to (5.5), we conclude
The Whitney condition in turn directly implies \(N_{{\mathcal {X}}}(x)\subset N_{M_{{\mathcal {X}}}}(x).\) Hence summing yields
where the last inclusion follows from the compatibility \(M\subset M_f\) and \(M\subset M_g\). Notice that \({{\widehat{f}}}+{{\widehat{g}}}\) agrees with \(f+g\) on a neighborhood of x in M. Hence if the inclusion, \(0\in \partial f(x)+\partial g(x)+N_{{\mathcal {X}}}(x)\), holds it must be that x is a critical point of the \(C^d\)-smooth function \(f+g\) restricted to M, in the classical sense. Applying the standard Sard’s theorem to each manifold M, the result follows. \(\square \)
Rights and permissions
About this article
Cite this article
Davis, D., Drusvyatskiy, D., Kakade, S. et al. Stochastic Subgradient Method Converges on Tame Functions. Found Comput Math 20, 119–154 (2020). https://doi.org/10.1007/s10208-018-09409-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10208-018-09409-5
Keywords
- Subgradient
- Proximal
- Stochastic subgradient method
- Differential inclusion
- Lyapunov function
- Semialgebraic
- Tame