In this section, we will study the convergence of the proposed algorithm. Due to the randomly chosen evaluation point within the algorithm, we will have to study probabilistic convergence behavior in terms of “almost sure convergence” and convergence in expectation. This notion of convergence as well as further assumptions and definitions are given in Section 3.1. In Section 3.2, we prove that the error in the gradient approximation goes to zero and finally apply this result in Section 3.3 to prove convergence of the CSG method.
Assumptions, definitions, and preliminary results
For the convergence analysis of Algorithm 1, the following three assumptions on the objective functional, the step length, and the sets Uad,Vad are an important ingredient. In the following, we will assume that these Assumptions are always satisfied without mentioning it explicitly.
Definition 4 (Lipschitz constants and maxima)
We will denote the Lipschitz constants of DJ and ∇uf with respect to both their arguments by \({L}_{^{D_{ J}}}^{_{(1)}},{L}_{^{D_{ J}}}^{_{ (2)}}\) and by \({L}_{^{\nabla _{ u} f}}^{_{ (1)}},{L}_{^{\nabla _{ u} f}}^{_{ (2)}}\). Their maximal absolute function values will be defined as \(C_{^{D_{ J}}},C_{^{\nabla _{ u} f}}\).
For Uad and Vad, we assume the following:
Assumption 5 (Regularity ofUadandVad)
The set \({U_{\text {ad}}} \subset \mathbb {R}^{d_{u}}\) is compact and convex. \(V_{\text {ad}} \subset \mathbb {R}^{d_{v}}\) is open and bounded. In addition, there exists \(c\in \mathbb {R}\) s.t. \(\left |V_{\text {ad}} \setminus {V}_{\text {ad}}^{r} \right | \leq \text {cr} \ \forall r\in (0,1)\), with \({V}_{\text {ad}}^{r} := \{x\in V_{\text {ad}} : B_{r}(x) \subset V_{ad}\}\) and where \(B_{r}(x) \subset \mathbb {R}^{d_{v}} \) is an open ball centered in \(x\in \mathbb {R}^{d_{v}}\) with radius r.
The latter assumption is fulfilled for non pathological open sets with finite perimeter.
Assumption 6 (Step length)
The step length \((\tau _{n})_{n \in \mathbb {N}}\) satisfies the following: \(\exists N \in \mathbb {N}\), \(c_{1},c_{2}\in \mathbb {R}_{>0}\), and \(\delta \in \left (0,\frac {1}{\max \limits \{d_{v},2\}}\right )\) s.t.
$$ c_{1} n^{-1}\leq \tau_{n} \leq c_{2} { n^{-1+\frac{1}{\max\{d_{v},2\}}-\delta}} \quad \forall n \in \mathbb{N}_{>N}. $$
These conditions on the step length satisfy the conditions stated in Robbins and Monro (1951, Eqns. (6) and (26)) as well as equivalently in Bottou et al. (2018, Eqn. (4.19) in the one-dimensional case, and can be seen as a higher dimensional equivalent.
Remark 7 (Step length for dv = 1 and dv = 2)
In case of a one- or two-dimensional set Vad, Assumption 6 is satisfied iff
$$ \tau_{n} \in \mathcal{O}(n^{-1})\cap o(n^{-\frac{1}{2}}) $$
with the Big Oh and Little Oh notation as defined in Bürgisser and Cucker (2013). In other words, the null series \((\tau _{n})_{n\in \mathbb {N}}\) must not tend faster to zero than \((n^{-1})_{n\in \mathbb {N}}\) but not as slow as \((n^{-\frac {1}{2}})_{n\in \mathbb {N}}\).
The lower bound for the stepsizes ensure that the accumulated stepsizes reach infinity and the algorithm does not get stuck due to their reduction. The upper bound ensures that the approximation of the search direction is appropriate. This is equivalent to the rate of convergence for empirical measures, see, e.g., Dudley (1969, Prop. 3.4.).
Despite these assumptions in the stepsize, in Theorem 20, a result will be stated for the case of a step length bounded away from zero.
To show convergence of the algorithm, we must first state the probability space setting.
Definition 8 (Probability space setup)
The probability space \(({\varOmega },\mathcal {A},\mathbb {P})\) is given by the following:
$$ \begin{array}{@{}rcl@{}} {\varOmega} &:=& {V}_{\text{ad}}^{\mathbb{N}}, \quad \mathbb{P} := \mu^{\otimes \mathbb{N}}\\ \mathcal{A} &:=& \sigma \{A_{1}\times \ldots \times A_{n}: A_{i} \in \mathcal{B}(V_{\text{ad}}), \forall i, n \in \mathbb{N}\}, \end{array} $$
where \(\mu ^{\otimes \mathbb {N}} (A_{1}\times {\ldots } \times A_{n}) = {\prod }_{i = 1}^{n} \frac {\mu (A_{i})}{\mu (V_{\text {ad}})} \) is the product measure and \(\mu = \lambda ^{d_{v}}\) the Lebesgue measure in \(\mathbb {R}^{d_{v}}\).
All the following random variables are defined in this setting. For the convergence of random variables, we use the following commonly used notation:
Definition 9 (Stochastic convergence)
A sequence of random variables \((Z_{n})_{n\in \mathbb {N}}\) converges almost surely to some random variable Z iff
$$ \mathbb{P}\left( \{\omega \in {\varOmega} : \underset{n\to\infty}{\lim}Z_{n}(\omega) = Z(\omega) \}\right) = 1, $$
which we denote by Zn→a.s.Z.
In this document, we define the following norm on the product space Uad × Vad.
Definition 10 (Norm inUad × Vad)
For better readability, we define the following ℓ1/ℓ2-norm on the product space Uad × Vad:
$$\left\| \binom{u}{v} \right\|_{*} := \|u\|_{2}+\|v\|_{2}.$$
Due to norm equivalence in finite dimensional spaces, the results presented later hold for all chosen norms in Uad and Vad and combinations thereof.
The orthogonal projection used in Algorithm 1 has some important properties:
Lemma 11 (Orthogonal projection)
Let \(S\subset \mathbb {R}^{n}\) for \(n\in \mathbb {N}_{>0}\) be closed and convex and let \(\mathcal {P}_{S}\) be the orthogonal projection, then the following holds for all \(x,y \in \mathbb {R}^{n}\) and z ∈ S:
-
a)
\((\mathcal {P}_{S}(x) - x)^{T} (\mathcal {P}_{S}(x) - z) \leq 0\),
-
b)
\( (\mathcal {P}_{S}(y) - \mathcal {P}_{S}(x))^{T}(y-x) \geq \|\mathcal {P}_{S}(y) - \mathcal {P}_{S}(x)\|^{2} \geq 0\),
-
c)
\(\|\mathcal {P}_{S}(y) - \mathcal {P}_{S}(x)) \| \leq \|y-x\|\).
Proof
(a) is (ii) in Aubin (2000, Thm. 1.4.1) and (b), and (c) are (iii) and (ii) in Aubin (2000, Prop. 1.4.1). □
For h ∈ C1(Uad) and Uad convex, the following sufficient conditions for first-order optimality are equivalent:
Corollary 12 (Optimality conditions)
For all u∗∈ Uad, the following items are equivalent:
-
i)
−∇h(u∗)T(u − u∗) ≤ 0 ∀u ∈ Uad,
-
ii)
\(\mathcal {P}(u^{*} -t\nabla h(u^{*})) = u^{*} \quad \forall t \geq 0\).
We call u∗ satisfying one of the above conditions a stationary point.
Proof
Define for u ∈ Uad the cone
$$ N_{{U_{\text{ad}}}}(u) := \{x \in \mathbb{R}^{d_{u}}: x^{T}({\bar{u}}-u) \le 0\ \forall \bar u \in {U_{\text{ad}}}\}. $$
Using Lemma 11 ((i) for “⇒” and (ii) for “⇐”), it is straightforward to see that for \(x \in \mathbb {R}^{d_{u}}\) and u ∈ Uad,
$$ {\mathcal{P}}_{{U_{\text{ad}}}} (x) = u \quad \Leftrightarrow \quad (x-u) \in {N}_{{U_{\text{ad}}}}(u). $$
Since \(-\nabla h \in N_{{U_{\text {ad}}}}(u^{*})\), the result follows. □
Error in the approximate gradient
In this subsection, we analyze the error in the n th iteration of the approximate gradient \(\hat G_{n}\) and the gradient of the objective functional ∇Fn. To do this, we define for v ∈ Vad,ω ∈Ω the sequence of random variables \(\left (X_{n}\right )_{n\in \mathbb {N}}\) by the following:
$$ X_{n}(\omega;v):= \underset{k = 1,\dots, n}{\min}\left\| \binom{u_{k}(\omega)-u_{n}(\omega)}{\omega_{k}-v} \right\|_{*} . $$
Lemma 13 (Convergence result)
For v ∈ Vad,
$$ \sum\limits_{n = 1}^\infty \mathbb{P}\left( X_n(\cdot ; v) > \varepsilon_{n} \right) < \infty, $$
where \(\varepsilon _{n} := 2C_{^{\nabla _{u} f}} c_{2} |V_{ad}| n^{-\frac {\delta }{2}} + \tilde {\varepsilon }_{n}\) and \(\tilde {\varepsilon }_{n} := n^{\frac {\delta }{2}-\frac {1}{\max \limits (2,d_{v})}}\) with c2 and δ defined in Assumption 6 and \(C_{^{\nabla _{u} f}}\) in Definition 4. Moreover,
$$ \sum\limits_{n = 1}^{\infty} \sup\limits_{v \in V_{ad}^{\varepsilon_{n}} } \mathbb{P}\left( X_n(\cdot; v) > \varepsilon_{n} \right) < \infty, $$
with \(V_{ad}^{\varepsilon _{n}}\) defined in Assumption 5.
Proof
By item (iii) in Lemma 11 we have
$$ \begin{array}{@{}rcl@{}} &&\mathbb{P}(X_{n}(\cdot;v) \geq \varepsilon_{n}) \\ &&\leq \mathbb{P}\left( \underset{k = i_{0} ,\dots, n}{\min}\left\| \left( \begin{array}{c}u_{k}(\omega)-u_{n}(\omega)\\ \omega_{k}-v\end{array}\right)\right\|_{\ast} \geq \varepsilon_{n} \right)\\ &&\leq \mathbb{P}\left( \sum\limits_{i=i_{0}}^{n-1}\|\tau_{i} \hat{G}_{i} \| + \underset{k = i_{0} ,\dots, n}{\min} \|\omega_{k} - v\| \geq \varepsilon_{n} \right) \\ &&\leq \mathbb{P}\left( C_{^{\nabla_{u} f}} c_{2} |V_{ad}|\sum\limits_{i=i_{0}}^{n-1} \frac{1}{i^\kappa} + \underset{k = i_{0},\dots, n}{\min} \|\omega_{k} - v\| \geq \varepsilon_{n} \right) \end{array} $$
where i0 := ⌈n − an + 1⌉ and κ ∈ (0,1) given by \(\kappa := 1-\frac {1}{\max \limits \{d_v,2\}}+\delta \). For dv = 1, we choose \(a_{n} = \sqrt {n}\) and if dv ≥ 2, we choose \(a_{n} = n^{1-\frac {\delta }{2}}\). Observe that for n > 2 we obtain
$$ \begin{array}{@{}rcl@{}} &&{}\sum\limits_{i=i_{0}}^{n-1} \frac{1}{i^{\kappa}} \le {\int}_{i_{0}-1}^{n} \frac{1}{s^{\kappa}} \mathrm{d} s = \frac{1}{1-\kappa}\cdot \left( n^{1-\kappa} - (\lceil n-a_n\rceil)^{1-\kappa} \right) \\ &&{}\le \frac{1}{1-\kappa} \cdot \left( n^{1-\kappa} - (n - a_{n})^{1-\kappa} \right) = \frac{1}{1 - \kappa} \cdot \left( \frac{n}{n^{\kappa}} - \frac{n - a_{n}}{(n - a_{n})^{\kappa}} \right)\\ &&{}= \frac{1}{1-\kappa}\cdot \left( \frac{n^{\kappa} (1-\frac{a_{n}}{n})^{\kappa} - n^{\kappa}}{n^{\kappa} \cdot (n-a_{n})^{\kappa}} \right) + \frac{a_n}{(1-\kappa)(n-a_{n})^{\kappa}}\\ &&{}\text{applying Bernoulli's inequality in the first term}\\ &&{}\le \frac{1}{1-\kappa} \cdot \left( \frac{-\kappa a_{n}}{(n - a_{n})^{\kappa}} \right) + \frac{a_{n}}{(1 - \kappa)(n - a_{n})^{\kappa}} = \frac{a_{n}}{n^{\kappa}}(1 - \tfrac{a_{n}}{n})^{-\kappa} \end{array} $$
As \(\frac {a_{n}}{n} = n^{\frac {\delta }{2}-\frac {1}{\max \limits \{d_{v},2\}}} \leq n^{-\frac {1}{2\max \limits \{d_{n},2\}}}\) we obtain
$$ \sum\limits_{i=i_{0}}^{n-1}\|\tau_{i} \hat{G}_{i} \| \leq \frac{C_{^{\nabla_{u} f}} c_{2} |V_{ad}|}{1- 2^{-\frac{1}{2\max\{d_{n},2\}}}} n^{-\frac{\delta}{2}} = \varepsilon_{n}. $$
(3.1)
For all v ∈ Vad there exists n large enough such that \(B_{\tilde {\varepsilon }_{n}}(v) \subset V_{ad}\). Hence,
$$ \begin{array}{@{}rcl@{}} &&\mathbb{P}(X_{n}(\cdot;v) \geq \varepsilon_{n}) \leq \mathbb{P}\left( \underset{k = i_{0} ,\dots, n-1}{\min} \|\omega_{k} - v\| \geq \tilde \varepsilon_{n} \right) \\ &&= \mathbb{P}\left( \|\omega_{k} - v\| \geq \tilde{\varepsilon}_{n} \quad \forall k\in \left\{i_{0} ,\dots, n-1\right\}\right)\\ &&=\prod\limits_{k=i_0}^{n-1} \mathbb{P}\left( \|\omega_{k} - v\| \geq \tilde \varepsilon_{n} \right) =\prod\limits_{k=i_{0}}^{n-1} \left( 1 - \frac{|B_{\tilde{\varepsilon}_{n}}(v)|}{|V_{ad}|} \right) \\ &&\le\left( 1 - \frac{|B_{\tilde{\varepsilon}_{n}}(v)|}{|V_{ad}|} \right)^{ a_{n}} = \left( 1 - \nu \cdot n^{-\frac{d_{v}}{\max(2,d_{v})}+ \frac{d_{v} \delta}{2}}\right)^{ a_{n}}, \end{array} $$
with \(\nu := \frac {\pi ^{\frac {d_{v}}{2}}}{\Gamma (\frac {d_{v}}{2}+1)|V_{ad}|}\). Thus
$$ \mathbb{P}(X_{n}(\cdot;v) \geq \varepsilon_{n}) \leq \left( 1 - \nu\cdot \frac{n^{\frac{\delta}{2}}}{a_{n}}\right)^{ a_{n}}. $$
By the limit comparison test, the corresponding series converge. Finally, note that Assumption 5 gives that \(v \in V_{ad}^{\varepsilon _{n}}\) implies \(B_{\varepsilon _{n}}(v) \subset V_{ad}\) and therefore \(\displaystyle \sup \limits _{v \in V_{ad}^{\varepsilon _{n}}} \mathbb {P}(X_{n}(\cdot ;v) \geq \varepsilon _{n}) \le \left (1 - \nu \cdot \frac {n^{\frac {\delta }{2}}}{a_{n}}\right )^{ a_{n}}\).
□
As a direct consequence of the latter result, we obtain almost sure convergence:
Corollary 14 (Density ofωinVad)
For all v ∈ Vad
$$ X_{n}(\cdot ; v) \longrightarrow{a.s.} 0 \quad\text{for}\quad n\to\infty. $$
Proof
The result follows by Lemma 1 and the Borel-Cantelli Lemma, see, e.g., Klenke (2013, Thm. 6.12). □
Thus, due to the Lipschitz continuity of ∇uf, and DJ the integral in ∇F(un) is increasingly is increasingly better approximated by \(\hat G_{n}\) for \(n \rightarrow \infty \):
Corollary 15 (Error in gradient approximation)
The norm of the difference between approximate gradient \(\hat G_{n}\) in the n th iteration (defined in Algorithm 1) and the gradient of the exact objective functional ∇F in un goes to zero, i.e.,
$$ \| \hat{G}_{n} - \nabla F_{n}\| \longrightarrow{a.s.} 0, \quad \underset{n\to\infty}{\lim} \mathbb{E}\left[\|\hat{G}_{n} - \nabla F_{n}\| \right] = 0 $$
and
$$ \sum\limits_{n = 0}^{\infty} \tau_{n} \mathbb{E}\left[ \|\hat{G}_{n} - \nabla F_{n} \|\right] < \infty. $$
(3.2)
Proof
For v ∈ Vad, define
$$ k^{n}(v) := \underset{{k = 1,\ldots,n}}{\arg \min} \left\{\left\|\binom{u_{k}(\omega)-u_{n}(\omega)}{\omega_{k}-v}\right\|_{*} \right\}. $$
Then,
$$ \hat G_{n} = {\int}_{V_{\text{ad}}} D_{ J}(\hat f(u_{n},\cdot),\omega_{k^{n}(v)})\nabla_{u} \hat f(u_{n},v) \mathrm{d} v, $$
where \(\hat f(u_{n},v) := f(u_{k^{n}(v)},\omega _{k^{n}(v)})\). By the Lipschitz continuity assumed in Assumption 1, we therefore obtain the following:
$$ \begin{array}{@{}rcl@{}} \| \hat{G}_{n} - F_{n}\| &\leq& \left( C_{^{\nabla_{ u} f}}{L}_{^{D_{ J}}}^{_{(1)}}\max\{L_{^{f}}^{_{ (1)}},{L}_{^{f}}^{_{ (2)}}\} + C_{^{\nabla_{ u} f}}{L}_{^{D_{ J}}}^{_{ (2)}}\right.\\ && \left. + C_{^{D_{ J}}} \max\{{L}_{^{\nabla_{ u} f}}^{_{ (1)}},L_{^{\nabla_{ u} f}}^{_{ (2)}}\}\right) {\int}_{V_{\text{ad}}} X_{n}(\omega;v) \mathrm{d} v, \end{array} $$
with constants defined in Definition 4 and Xn(⋅;v) as defined in Lemma 1. Recall that Uad, Vad are bounded. Now, the almost sure convergence, as well as the convergence of the expectations, is followed by Lebesgue’s dominated convergence result.
Finally, let C be a generic constant and ε > 0. Since \(\sup _{v \in V_{\text {ad}}} X_{n}(\cdot ;v) \le D < \infty \), where D := diam(Vad) + diam(Uad) denotes the diameter of Vad plus the diameter of Uad, and by Fubini’s theorem, we have the following:
$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}\left[\|\hat{G}_{n} - \nabla F_{n} \|\right]\leq C \mathbb{E}\left[{\int}_{V_{\text{ad}}} X_{n}(\cdot;v) \mathrm{d}v \right] \\ &\leq& C {\int}_{V_{\text{ad}}} \varepsilon \mathbb{P}(X_{n}(\cdot;v) \leq \varepsilon) + 2 D \mathbb{P}(X_{n}(\cdot;v) > \varepsilon) \mathrm{d}v \\ &\leq& C \left( \varepsilon + {\int}_{V_{\text{ad}}\setminus V_{\text{ad}}^{\varepsilon}} \mathbb{P}(X_{n}(\cdot;v) > \varepsilon) \mathrm{d}v + {\int}_{V_{\text{ad}}^{\varepsilon}} \mathbb{P}(X_{n}(\cdot;v) > \varepsilon) \mathrm{d}v\right)\\ &\le& C \left( 2\varepsilon + \underset{v \in V_{\text{ad}}^{\varepsilon}}{\sup}\mathbb{P}(X_{n}(\cdot;v) > \varepsilon)\right), \end{array} $$
where \({V}_{\text {ad}}^{r}\) is given in Assumption 5. If we choose \(\varepsilon = \varepsilon _{n} = 2 C_{^{\nabla _{ u} f}} c_{2}\cdot n^{-\frac {1}{2}} + n^{-\frac {1}{\max \limits (2,d_{v})}+\frac {\delta }{2}}\) as in Lemma 1, we obtain the following:
$$ \begin{array}{@{}rcl@{}} &&\sum\limits_{n = 1}^{\infty} \tau_{n} \mathbb{E}\left[\|\hat{G}_{n} - \nabla F_{n} \|\right] \\ &&\le C \left( \sum\limits_{n = 1}^{\infty} \frac{1}{n^{1+\delta}} + \frac{1}{n^{1 + \frac{\delta}{2}}} + \underset{v \in {V}_{\text{ad}}^{\varepsilon_{n}}}{\sup} \mathbb{P}(X_{n}(\cdot;v) > \varepsilon_{n}) \right) < \infty, \end{array} $$
which concludes the proof. □
Convergence result
As we have seen in Corollary 14, the error \(\|\hat {G}_{n} - \nabla F_{n}\| \) converges almost surely and in expectation to zero for \(n\rightarrow \infty \). It remains to provide sufficient conditions under which the algorithm converges to a stationary point.
Lemma 16 (Objective functional values)
The difference of the objective functional values in iteration \(n\in \mathbb {N}\) can be approximated as follows:
$$ F_{n+1} - F_{n} \leq -\frac{1}{\tau_{n}} \|u_{n+1} - u_{n}\|^{2} + \phi_{n}, $$
with \(\phi _{n} := \tau _{n} \|\nabla F_{n}-\hat {G}_{n} \| \cdot \|\hat {G}_{n}\| + {\tau _{n}^{2}} C\|\hat {G}_{n}\|^{2}.\)with a constant \(C \in \mathbb {R}_{\geq 0}\) depending on the lipschitz constants and suprema of the involved functions.
Proof
By the mean value theorem, there is a c ∈ (0,1) such that (we set \(\nabla {{F}_{n}^{c}} := \nabla F((1-c) u_{n} + c u_{n+1}\))
$$ \begin{array}{@{}rcl@{}} F_{n+1} - F_{n} &=& (\nabla {{F}_{n}^{c}})^{T}\left( u_{n+1}- u_{n}\right) \\ &=& \nabla {{F}_{n}^{T}}\left( u_{n+1}- u_{n}\right) + (\nabla {{F}_{n}^{c}} - \nabla F_{n})^{T}\left( u_{n+1}- u_{n}\right) \\ &\leq& \nabla {{F}_{n}^{T}}\left( u_{n+1}- u_{n}\right) + C \|u_{n+1}-u_{n}\|^{2}, \end{array} $$
using the Cauchy–Schwartz inequality and Definition 4. Recall that \(u_{n+1} = P_{U_{\text {ad}}}(u_{n} - \tau _{n} \hat {G_{n}})\). With Lemma 11 (b), (c), and the Cauchy–Schwartz inequality for the first term of the right-hand side of the latter equation, we obtain the following:
$$ \begin{array}{@{}rcl@{}} &&\nabla {{F}_{n}^{T}} \left( \mathcal{P}_{U_{\text{ad}}}(u_{n} - \tau_{n} \hat{G_{n}}) - u_{n}\right) \\ &=& \hat {G_{n}^{T}}\left( \mathcal{P}_{U_{\text{ad}}}(u_{n} - \tau_{n} \hat{G_{n}})- u_{n}\right) \\ &&+ (\nabla F_{n} - \hat G_{n})^{T}\left( \mathcal{P}_{U_{ad}}(u_{n} - \tau_{n} \hat{G_{n}})- u_{n}\right)\\ &\leq& -\frac{1}{\tau_{n}}(u_{n} - \tau_{n}\hat G_{n} -u_{n})^{T} \cdot \left( \mathcal{P}_{U_{ad}}(u_{n} - \tau_{n} \hat{G_{n}})- u_{n}\right)\\ &&+ \|\nabla F_{n} - \hat G_{n}\|\|\mathcal{P}_{U_{ad}}(u_{n} - \tau_{n} \hat{G_{n}}) - u_{n}\|\\ &\leq& -\frac{1}{\tau_{n}}\|u_{n+1} - u_{n}\|^{2} + \tau_{n} \|\nabla F_{n} - \hat G_{n}\| \|\hat G_{n}\|. \end{array} $$
Applying Lemma 11 (c) to the second term yields \( \|\mathcal {P}_{U_{\text {ad}}}(u_{n} - \tau _{n} \hat {G_{n}}) - u_{n}\|^{2}\leq {\tau _{n}^{2}} \|\hat G_{n}\|^{2}. \)□
Since the first term in right-hand side of the estimate in the above Lemma is strictly negative while the second term is strictly positive, we may expect a descent, provided \(\mathbb {E}\left [\phi _{n}\right ]\) is small enough.
Corollary 17 (Convergence result)
We have the following:
$$\sum\limits_{n = 0}^{\infty} \mathbb{E}\left[ \phi_{n}\right] < \infty,$$
where \(\phi _{n} := \tau _{n} \|\nabla F_{n}-\hat {G}_{n} \| \|\hat {G}_{n}\| + {\tau _{n}^{2}}\frac {C_{\nabla _{u} f}}{2}\|\hat {G}_{n}\|^{2}.\)
Proof
Since \(\|\hat G_{n}\|\) is bounded and \(\sum \limits _{n = 1}^{\infty } {\tau _{n}^{2}} < \infty \) by Assumption 6, the result follows by (3.2). □
Before we present our main results, we need the following auxiliary result:
Lemma 18 (Projection of gradient steps)
$$ \|P_{{U_{\text{ad}}}}(u_{n} - t \hat G_{n}) -u_{n}\| \leq \frac{t}{\tau_{n}}\|u_{n+1}-u_{n}\|\quad t \geq 0. $$
Proof
Define \(x:=u_{n},y:=\hat G_{n}\), and τ := τn. We assume that x − τy∉Uad (otherwise the result follows by Lemma 11) and set \(n_{\tau }:= x-\tau y - \mathcal {P}_{U_{\text {ad}}}(x-\tau y)\) and
$$ H:=\{u\in\mathbb{R}^{d_{u}} \ | \ u^{T} n_{\tau} \leq \mathcal{P}_{U_{\text{ad}}}(x-ty)^{T} n_{\tau} \}. $$
Since Uad is convex, we have Uad ⊂ H, and therefore \(\forall u\in \mathbb {R}^{d_{u}}\) by Lemma 11,
$$ \|u-x\| \geq \|\mathcal{P}_{H}(u) - x\|\geq \|\mathcal{P}_{U_{\text{ad}}}(u) - x\|, $$
(3.3)
where \(\mathcal {P}_{H}\) is the orthogonal projection onto H (compare Fig. 3). With \(B:= \frac {t}{\tau }\left (\mathcal {P}_{U_{\text {ad}}}(x-\tau y)-x\right ) + x\) and (3.3), we have the following:
$$ \begin{array}{@{}rcl@{}} &&\frac{t}{\tau}\|\mathcal{P}_{U_{\text{ad}}}(x-\tau y) - x\| \geq \|\mathcal{P}_{H}(B)-x\|\\ &=& \|\mathcal{P}_{H}(x-ty)-x\| \geq \|\mathcal{P}_{U_{\text{ad}}}(x-\text{ty})-x\|. \end{array} $$
□
Recalling the characterization of stationary points from Corollary 12, we obtain our first main result:
Theorem 19 (Convergence result)
For all t ≥ 0,
$$ \sum\limits_{n=0}^{\infty} \tau_{n} \mathbb{E}\left[\|\mathcal{P}_{U_{\text{ad}}}(u_{n} - t \nabla F_{n}) -u_{n}\|^{2}\right] < \infty. $$
Proof
First, note that by the compactness of Uad and regularity of F, \(F_{\inf } := \inf _{u \in {U_{\text {ad}}}} F(u) > -\infty \). Summing both sides of the inequality in Lemma 16 up to an \(N\in \mathbb {N}\) gives the following:
$$ \begin{array}{@{}rcl@{}} F_{\inf } - F_{0} &\leq& \mathbb{E}\left[F_{N+1}\right] - F_{0} = \sum\limits_{n = 0}^{N} \mathbb{E}\left[F_{n+1} - F_{n}\right]\\ &\leq& \sum\limits_{n = 0}^{N}\left( -\frac{1}{\tau_{n}} \mathbb{E}\left[ \|u_{n+1} - u_{n}\|^{2} \right] + \mathbb{E}\left[\phi_{n}\right] \right). \end{array} $$
Hence, by Corollary 17,
$$ \sum\limits_{n = 0}^{\infty} \frac{1}{\tau_{n}} \mathbb{E}\left[ \|u_{n+1} - u_{n}\|^{2} \right] \leq F_{0} - F_{\inf } + {\sum}_{n = 0}^{\infty} \mathbb{E}\left[\phi_{n}\right]< \infty. $$
Using Lemma 11 (ii) together with Lemma 18 gives for all \(n \in \mathbb {N}\) sufficiently large,
$$ \begin{array}{@{}rcl@{}} && \|\mathcal{P}_{U_{\text{ad}}}(u_{n} - t \nabla F_{n}) -u_{n}\|^{2} \\ &\leq& \left( \|\mathcal{P}_{U_{\text{ad}}}(u_{n} - t \hat G_{n}) -u_{n}\|\right. \\ &&+ \left.\|\mathcal{P}_{U_{\text{ad}}}(u_{n} - t \nabla F_{n}) - \mathcal{P}_{U_{\text{ad}}}(u_{n} - t \hat G_{n})\|\right)^{2} \\ &\leq& \left( \frac{t}{\tau_{n}}\|u_{n+1} -u_{n}\| + t\|\hat G_{n} - \nabla F_{n}\|\right)^{2}. \end{array} $$
Since \(\|u_{n+1} - u_{n}\| \le \tau _{n} \| \hat G_{n}\|\), this combined with Corollary 15 gives the result. □
As a direct consequence, we have the following:
Theorem 20 (Main theorem)
Let \((u_{n})_{n\in \mathbb {N}}\) be generated by Algorithm 1. Then, there exists a sub-sequence \((u_{n_{k}})_{k\in \mathbb {N}}\) converging to a stationary point, i.e.,
$$ \underset{n\to\infty}{\liminf}\ {\mathbb{E}}\left[\|\mathcal{P}_{U_{\text{ad}}}(u_{n} - t \nabla F_{n}) -u_{n}\|^{2}\right] = 0 \quad \forall t >0. $$
Proof
Direct consequence of Theorem 19. □
For applications, the condition on the step length (Assumption 6) is inconvenient, since the step length becomes very small and the algorithm thus progresses only slowly. If the algorithm is performed with a constant stepsize, and if the sequence \((u_{n})_{n\in \mathbb {N}}\) converges to some u∗∈ Uad, then the following theorem demonstrates that u∗ is a stationary point.
Theorem 21 (Convergent series)
Assume the time-step series \((\tau _{n})_{n\in \mathbb {N}}\) satisfies \(\tau _{n} \geq \tau \quad \forall n \in \mathbb {N} \) for some τ > 0. Let further \((v_{n})_{n\in \mathbb {N}}\) be dense in Vad and assume \((u_{n})_{n\in \mathbb {N}}\) converges to u∗∈ Uad. Then, u∗ is a stationary point of F, i.e.,
$$ \mathbb{E}\left[\|\mathcal{P}_{U_{\text{ad}}}(u^{*} - t \nabla F(u^{*})) -u^{*}\|^{2}\right] = 0 \quad \forall t > 0. $$
Proof
Similar to Corollary 15, \(\|\hat G_{n} - \nabla F_{n}\| \rightarrow 0\). Thus, by convergence of \((u_{n})_{n\in \mathbb {N}}\), we obtain the following:
$$ \begin{array}{@{}rcl@{}} &&\|\mathcal{P}_{U_{\text{ad}}}(u^{*} - t \nabla F(u^{*})) - u^{*}\| \\ &=& \underset{n\to \infty}{\lim} \|\mathcal{P}_{U_{\text{ad}}}(u_{n} - t \nabla F_{n}) - u_{n}\| \\ &\leq& \underset{n\to \infty}{\lim} \left( \|\mathcal{P}_{U_{\text{ad}}}(u_{n} - t \hat G_{n}) - u_{n}\|\right.\\ &&+ \left.\|\mathcal{P}_{U_{\text{ad}}}(u_{n} - t \hat G_{n}) - \mathcal{P}_{U_{\text{ad}}}(u_{n} - t \nabla F_{n})\|\right)\\ &\leq& \underset{n\to \infty}{\lim} \frac{t}{\tau_{n}}\left( \|\mathcal{P}_{U_{\text{ad}}}(u_{n} - \tau_{n} \hat G_{n}) - u_{n}\| \right.\\ &&+ \left.\|\mathcal{P}_{U_{\text{ad}}}(u_{n} - t \hat G_{n}) - \mathcal{P}_{U_{\text{ad}}}(u_{n} - t \nabla F_{n})\|\right)\\ &\leq& \underset{n\to \infty}{\lim} \frac{t}{\tau}\|\mathcal{P}_{U_{\text{ad}}}(u_{n} - \tau_{n} \hat G_{n}) - u_{n}\| + \underset{n\to \infty}{\lim} t \| \hat G_{n} - \nabla F_{n}\|\\ \end{array} $$
$$ \begin{array}{@{}rcl@{}} &=& \frac{t}{\tau} \underset{n\to \infty}{\lim} \|u_{n+1} -u_{n}\| + t\underset{n\to \infty}{\lim} \| \hat G_{n} - \nabla F_{n}\| = 0 \end{array} $$
by Lemma 18 and Lemma 11 (c). □
Note that almost all sequences \((v_{n})_{n\in \mathbb {N}}\) that are given by the random number generator in Algorithm 1 are dense in Vad. In addition to the convergence properties shown in the latter theorems, the algorithm also approximates the objective functional value with arbitrary accuracy:
Corollary 22 (Approximation ofF)
Let the series \((u_{n})_{n\in \mathbb {N}}\) be generated by Algorithm 1. Then, for all convergent subsequences \((u_{n_{k}})_{k\in \mathbb {N}}\) with \(u_{n_{k}} \rightarrow u^{*}\), we obtain for \(\hat F\) as defined in Alg. 1: \( |\hat F_{n_{k}} - F(u^{*})| \stackrel {\text {a.s.}}{\rightarrow } 0 \), assuming further \((v_{n_{k}})_{k\in \mathbb {N}}\) is dense in Vad, we obtain \( \lim _{k\rightarrow \infty } \hat F_{n_{k}} = F(u^{*}). \)
Proof
The proof is similar to the proof of Lemma 1 relying on the Lipschitz continuity of F—as a direct consequence of Assumption 1—and Corollary 13. □
Remark 23 (Termination condition)
By the regularity assumption on the objective functional in Assumption 1, the termination condition in Algorithm 1 can be posed as follows:
$$ \|\mathcal{P}_{{U_{\text{ad}}}}\left( u_{n} - \hat G_{n}\right) - u_{n} \| \leq \epsilon$$
for 𝜖 > 0. This is obviously not possible for SG. To satisfy such a condition by the SAG method, the discretization of the objective functional has to be sufficiently fine in order to approximate the gradient with sufficiently accurate. However, depending on the particular example, an a priori discretization satisfying this property can be hard to determine.