1 Introduction

We consider convex optimization problems of the form

$$\begin{aligned} w^* = {\mathrm{arg min}}_{w \in H} F(w), \end{aligned}$$
(1)

where H is a real Hilbert space and

$$\begin{aligned} F(w) = \mathbf {E}_{\xi } [ f(w, \xi ) ]. \end{aligned}$$

The main applications we have in mind are supervised learning tasks. In such a problem, a set of data samples \(\{x_j\}_{j=1}^{n}\) with corresponding labels \(\{y_j\}_{j=1}^{n}\) is given, as well as a classifier h depending on the parameters w. The goal is to find w such that \(h(w,x_j) \approx y_j\) for all \(j \in \{1,\dots ,n\}\). This is done by minimizing

$$\begin{aligned} F(w) = \frac{1}{n} \sum _{j = 1}^{n}{\ell (h(w, x_j), y_j)}, \end{aligned}$$
(2)

where \(\ell\) is a given loss function. We refer to, e.g., Bottou et al. [9] for an overview. In order to reduce the computational costs, it has been proved to be useful to split F into a collection of functions f of the type

$$\begin{aligned} f(w,\xi ) = \frac{1}{|B_{\xi } |} \sum _{j \in B_{\xi } } {\ell (h(w, x_j), y_j)}, \end{aligned}$$

where \(B_{\xi }\) is a random subset of \(\{1,\dots ,n\}\), referred to as a batch. In particular, the case of \(|B_{\xi } |= 1\) is interesting for applications, as it corresponds to a separation of the data into single samples.

A commonly used method for such problems is the stochastic gradient method (SGD), given by the iteration

$$\begin{aligned} w^{k+1} = w^k - \alpha _k \nabla f(w^k, \xi ^k), \end{aligned}$$

where \(\alpha _k >0\) denotes a step size, \(\{\xi ^k\}_{k \in \mathbb {N}}\) is a family of jointly independent random variables and \(\nabla\) denotes the Gâteaux derivative with respect to the first variable. The idea is that in each step we choose a random part \(f(\cdot , \xi )\) of F and go in the direction of the negative gradient of this function. SGD corresponds to a stochastic version of the explicit (forward) Euler scheme applied to the gradient flow

$$\begin{aligned} \dot{w} = - \nabla F(w). \end{aligned}$$

This differential equation is frequently stiff, which means that the method often suffers from stability issues.

The restatement of the problem as a gradient flow suggests that we could avoid such stability problems by instead considering a stochastic version of implicit (backward) Euler, given by

$$\begin{aligned} w^{k+1} = w^k - \alpha _k \nabla f(w^{k+1}, \xi ^k). \end{aligned}$$

In the deterministic setting, this method has a long history under the name proximal point method, because it is equivalent to

$$\begin{aligned} w^{k+1} = {\mathrm{arg min}}_{w \in H} \left\{ \alpha F(w) + \frac{1}{2} \Vert w - w^k\Vert ^2 \right\} = \text {prox}_{\alpha F}(w^k), \end{aligned}$$

where

$$\begin{aligned} \text {prox}_{\alpha F}(w^k) = (I + \alpha \nabla F)^{-1} w^k. \end{aligned}$$

The proximal point method has been studied extensively in the infinite dimensional but deterministic case, beginning with the work of Rockafellar [28]. Several convergence results and connections to other methods such as the Douglas–Rachford splitting are collected in Eckstein and Bertsekas [13], see also Güler [17]. In the strongly convex case, the main convergence analysis idea is to observe that the gradient is strongly monotone. Then the resolvent \((I + \alpha \nabla F)^{-1}\) is a strict contraction, and the Banach fixed point theorem shows that \(\{w^k\}_{k \in \mathbb {N}}\) converges to \(w^*\) in norm.

Following Ryu and Boyd [32], we will refer to the stochastic version as stochastic proximal iteration (SPI). We note that the computational cost of one SPI step is in general much higher than for SGD, and indeed often infeasible. However, in many special cases a clever reformulation can result in very similar costs. If so, then SPI should be preferred over SGD, as it will converge more reliably. We provide such an example in Sect. 5.

The main goal of this paper is to prove sub-linear convergence of the type

$$\begin{aligned} \mathbf {E}\left[ \Vert w^k - w^* \Vert ^2 \right] \le \frac{C}{k} \end{aligned}$$

in an infinite-dimensional setting, i.e. where \(\{w^k\}_{k \in \mathbb {N}}\) and \(w^*\) are elements in a Hilbert space H. As shown in e.g. [1, 26], this is optimal in the sense that we cannot expect a better asymptotic rate even in the finite-dimensional case.

Most previous convergence results in this setting only provide guarantees for convergence, without an explicit error bound. The convergence is usually also in a rather weak norm. This is mainly due to weak assumptions on the involved functions and operators. Overall, little work has been done to consider SPI in an infinite dimensional space. A few exceptions are given by Bianchi [7], where maximal monotone operators \(\nabla F :H \rightarrow 2^H\) are considered and weak ergodic convergence and norm convergence is proved. In Rosasco et al. [30], the authors work with an infinite dimensional setting and an implicit-explicit splitting where \(\nabla F\) is decomposed in a regular and an irregular part. The regular part is considered explicitly but with a stochastic approximation while the irregular part is used in a deterministic proximal step. They prove both \(\nabla F(w^k) \rightarrow \nabla F(w^*)\) and \(w^k \rightarrow w^*\) in H as \(k \rightarrow \infty\). Without further assumptions, neither of these approaches yield convergence rates.

In the finite-dimensional case, stronger assumptions are typically made, with better convergence guarantees as a result. Nevertheless, for the SPI scheme in particular, we are only aware of the unpublished manuscript [32], which suggests \(\nicefrac {1}{k}\) convergence in \(\mathbb {R}^d\). Based on [32], the implicit method has also been considered in a few other works: In Patrascu and Necoara [24], a SPI method with additional constraints on the domain was studied. A slightly more general setting that includes the SPI has been considered in Davis and Drusvyatskiy [12]. Toulis and Airoldi and Toulis et al. studied such an implicit scheme in [35,36,37]. Finally, very recently and during the preparation of this work, [20] was published, wherein both SGD and proximal methods for composite problems are analyzed in a common framework based on bounded gradients. This is a generalization of the basic setting in a different direction than our work.

Whenever using an implicit scheme, it is essential to solve the appearing implicit equation effectively. This can be impeded by large batches for the stochastic approximation of F. On the other hand, a larger batch improves the accuracy of the approximation of the function. In Toulis et al. [39, 40] and Ryu and Yin [33], a compromise was found by solving several implicit problems on small batches and taking the average of these results. This corresponds to a sum splitting. Furthermore, implicit-explicit splittings can be found in Patrascu and Irofti [23], Ryu and Yin [33], Salim et al. [34], Bianchi and Hachem [8] and Bertsekas [6]. A few more related schemes have been considered in Asi and Duchi [2, 3] and Toulis et al. [38]. More information about the complexity of solving these kinds of implicit equations and the corresponding implementation can be found in Fagan and Iyengar [16] and Tran et al. in [40].

Our aim is to bridge the gap between the “strong finite-dimensional” and “weak infinite-dimensional” settings, by extending the approach of [32] to the infinite-dimensional case. We also further extend the results by allowing for more general Lipschitz conditions on \(\nabla f(\cdot ,\xi )\), provided that sufficient guarantees can be made on the integrability near the minimum \(w^*\). In particular, we make the less restrictive assumption that for every function \(f(\cdot , \xi )\) and every ball of radius \(R>0\) around the origin there is a Lipschitz constant \(L_{\xi }(R)\) that grows polynomially with R. We also weaken the standard assumption of strong convexity and only demand that the functions are strongly convex for some realizations.

We note that if F is only convex then there might be multiple local minima, and proving convergence in norm is in general not possible. On the other hand, if every \(f(\cdot , \xi )\) is strongly convex then parts of the analysis can be simplified. The assumptions made in this article are thus situated between these two extremes, where it is still possible to prove convergence results similar to the strongly convex case but under milder assumptions.

These strong convergence results can then be applied to, e.g., the setting where there is an original infinite-dimensional optimization problem which is subsequently discretized into a series of finite-dimensional problems. Given a reasonable discretization, each of those problems will then satisfy the same convergence guarantees.

Our analysis closely follows the finite-dimensional approach [32]. However, several arguments no longer work in the infinite-dimensional case (such as the unit ball being compact, or a linear operator having a minimal eigenvalue) and we fix those. Additionally, we simplify several of the remaining arguments, provide many omitted, but critical, details and extend the results to more general operators.

A brief outline of the paper is as follows. The main assumptions that we make are stated in Sect. 2, as well as the main theorem. Then we prove a number of preliminary results in Sect. 3, before we can tackle the main proof in Sect. 4. In Sect. 5 we describe a numerical experiment that illustrates our results, and then we summarize our findings in Sect. 6.

2 Assumptions and main theorem

Let \((\Omega , \mathcal {F}, \mathbf {P})\) be a complete probability space and let \(\{\xi ^k\}_{k \in \mathbb {N}}\) be a family of jointly independent random variables on \(\Omega\). Each realization of \(\xi ^k\) corresponds to a different batch. Let \((H, ( \cdot , \cdot )_{}, \Vert \cdot \Vert )\) be a real Hilbert space and \((H^*, ( \cdot , \cdot )_{H^*}, \Vert \cdot \Vert _{H^*} )\) its dual. Since H is a Hilbert space, there exists an isometric isomorphism \(\iota :H^* \rightarrow H\) such that \(\iota ^{-1} :H \rightarrow H^*\) with \(\iota ^{-1}: u \mapsto ( u , \cdot )_{}\). Furthermore, the dual pairing is denoted by \(\langle u' , u\rangle _{} = u'(u)\) for \(u' \in H^*\) and \(u \in H\). It satisfies

$$\begin{aligned} \langle \iota ^{-1} u , v\rangle _{} = ( u , v )_{} \quad \text {and} \quad \langle u' , v\rangle _{} = ( \iota u' , v )_{}, \quad u,v \in H, u' \in H^*. \end{aligned}$$

We denote the space of linear bounded operators mapping H into H by \(\mathcal {L}(H)\). For a symmetric operator S, we say that it is positive if \(( Su , u )_{} \ge 0\) for all \(u \in H\). It is called strictly positive if \(( Su , u )_{} > 0\) for all \(u \in H\) such that \(u \ne 0\).

For the function \(f(\cdot , \xi ) :H \times \Omega \rightarrow (-\infty , \infty ]\), we use \(\nabla\), as in \(\nabla f(u, \xi )\), to denote differentiation with respect to the first variable. When we present an argument that holds almost surely, we will frequently omit \(\xi\) from the notation and simply write f(u) rather than \(f(u, \xi )\). Given a random variable X on \(\Omega\), we denote the expectation with respect to \(\mathbf {P}\) by \(\mathbf {E}[X]\). We use sub-indices, such as in \(\mathbf {E}_{\xi }[\cdot ]\), to denote expectations with respect to the probability distribution of the random variable \(\xi\).

We consider the stochastic proximal iteration (SPI) scheme given by

$$\begin{aligned} w^{k+1} = w^k - \alpha _k \iota \nabla f(w^{k+1}, \xi ^k) \quad \text { in } H, \quad \quad w^1 = w_1 \quad \text { in } H, \end{aligned}$$
(3)

for minimizing

$$\begin{aligned} F(w) = \mathbf {E}_{\xi } [ f(w, \xi ) ], \end{aligned}$$

where f and F fulfill the following assumption.

For the family of jointly independent random variables \(\{\xi ^k\}_{k \in \mathbb {N}}\), we are interested in the total expectation

$$\begin{aligned} \mathbf {E}_{k}\left[ \Vert X\Vert ^2 \right] := \mathbf {E}_{\xi ^1}\left[ \mathbf {E}_{\xi ^2}\left[ \cdots \mathbf {E}_{\xi ^{k}} \left[ \Vert X \Vert ^2 \right] \cdots \right]\right]. \end{aligned}$$

Since the random variables \(\{\xi ^k\}_{k \in \mathbb {N}}\) are jointly independent, and \(w^k\) only depends on \(\xi ^j\), \(j \le k-1\), this expectation coincides with the expectation with respect to the joint probability distribution of \(\xi ^1, \ldots , \xi ^{k-1}\). In the rest of the paper, it often occurs that a statement does not involve an expectation but contains a random variable. Where it does not cause any confusion, such a statement is assumed to hold almost surely even if this is not explicitly stated.

Assumption 1

For a random variable \(\xi\) on \(\Omega\), let the function \(f(\cdot , \xi ) :\Omega \times H \rightarrow (-\infty , \infty ]\) be given such that \(\omega \mapsto f(v, \xi (\omega ))\) is measurable for every \(v \in H\) and such that \(f(\cdot , \xi )\) is convex, lower semi-continuous and proper almost surely. Additionally, \(f(\cdot , \xi )\) fulfills the following conditions:

  • The expectation \(\mathbf {E}_{\xi }\left[f(\cdot , \xi )\right] =: F(\cdot )\) is lower semi-continuous and proper.

  • The function \(f(\cdot , \xi )\) is Gâteaux differentiable almost surely on a non-empty common domain \(\mathcal {D}\left( \nabla f \right) \subseteq H\), i.e. for all for all \(v,w \in \mathcal {D}\left( \nabla f \right)\) the inequality \(\langle \iota \nabla f (v, \xi ) , w\rangle _{} = \lim _{h \rightarrow 0} \frac{f(v + hw, \xi ) - f(v, \xi )}{h}\) is fulfilled almost surely.

  • There exists \(m \in \mathbb {N}\) such that \(\left(\mathbf{E}_{\xi }\left[ \Vert \nabla f(w^*,\xi )\Vert _{H^*}^{2^m} \right]\right)^{2^{-m}} =: \sigma < \infty\).

  • For every \(R > 0\) there exists \(L_{\xi }(R) :\Omega \rightarrow \mathbb {R}\) such that

    $$\begin{aligned} \Vert \nabla f(u, \xi ) - \nabla f(v, \xi ) \Vert _{H^*} \le L_{\xi }(R) \Vert u - v\Vert \end{aligned}$$

    almost surely for all \(u, v \in \mathcal {D}\left( \nabla f \right)\) with \(\Vert u \Vert , \Vert v\Vert \le R\). Furthermore, there exists a polynomial \(P :\mathbb {R}\rightarrow \mathbb {R}\) of degree \(2^m -2\) such that \(L_{\xi }(R) \le P(R)\) almost surely.

  • There exist a random variable \(M_{\xi } :\Omega \rightarrow \mathcal {L}(H)\) such that the image is symmetric and a random variable \(\mu _{\xi } :\Omega \rightarrow [0, \infty )\) such that \(\mathbf {E}_{\xi }[\mu _{\xi }] = \mu > 0\) and \(\mathbf {E}_{\xi }[\mu _{\xi }^2] = \nu ^2 < \infty\). Moreover,

    $$\begin{aligned} \langle \nabla f(u, \xi ) - \nabla f(v, \xi ) , u - v\rangle _{} \ge ( M_{\xi }(u - v) , u - v )_{} \ge \mu _{\xi } \Vert u - v\Vert ^2 \end{aligned}$$

    is fulfilled almost surely for all \(u,v \in \mathcal {D}\left( \nabla f \right)\).

An immediate result of Assumption 1, is that the gradient \(\nabla f(\cdot , \xi )\) is maximal monotone almost surely, see [27, Theorem A]. As a consequence, the resolvent (proximal operator)

$$\begin{aligned} T_{f, \xi } = (I + \nabla f(\cdot ,\xi ))^{-1} \end{aligned}$$

is well-defined almost surely, see Lemma 1 for more details. Further, each resolvent maps into \(\mathcal {D}\left( \nabla f \right)\), and as a consequence every iterate \(w^k \in \mathcal {D}\left( \nabla f \right)\). Finally, we may interchange expectation and differentiation so that \(\nabla F(w) = \mathbf {E}_{\xi }[\nabla f(\xi , w)]\). Note that this means that the approximation \(\nabla f(\cdot , \xi )\) is an unbiased estimate of the full gradient \(\nabla F\). In our case, this property can be shown via a straightforward argument based on dominated convergence similar to [32, Lemma 6], but we note that it also holds in more general settings [21, 29].

Remark 1

The idea behind the operators \(M_{\xi }\) is that each \(f(\cdot , \xi )\) is is allowed to be only convex rather than strongly convex. However, they should be strongly convex for some realizations, such that \(f(\cdot , \xi )\) is strongly convex in expectation. By assumption, F is lower semi-continuous, proper and strongly convex, so there is a minimum \(w^*\) of (1) (c.f. [4, Proposition 1.4]) which is unique due to the strong convexity.

Remark 2

Note that the local Lipschitz constant of Assumption 1 is a generalization compared to [32] and other existing literature. Instead of asking for one Lipschitz constant \(L_{\xi }\) that is valid on the entire domain, we only ask for a Lipschitz constant \(L_{\xi }(R)\) that depends on the norm of the input elements \(u, v \in \mathcal {D}(\nabla f)\). This means in particular that \(L_{\xi }(R)\) may tend to infinity as \(R \rightarrow \infty\). In the coming analysis we handle this by applying an a priori bound (Lemma 2) that shows that the solution is bounded and thus R is bounded too.

While the properness of F needs to be verified by application-specific means, the lower semi-continuity can be guaranteed on a more general level in different ways. If, e.g., it is additionally known that \(\mathbf {E}_{\xi } \left[\inf _{u \in H} f(u, \xi ) \right] > -\infty\) then one can employ Fatou’s lemma ([22, Theorem 2.3.6]) as in [32, Lemma 5], or slightly modify [5, Corollary 9.4].

We note that from a function analytic point of view, we are dealing with bounded rather than unbounded operators \(\nabla F\). However, also operators that are traditionally seen as unbounded fit into the framework, given that the space H is chosen properly. For example, the functional \(F(w) = \frac{1}{2}\int {\Vert \nabla w\Vert ^2}\) corresponding to \(\nabla F = - \Delta\), the negative Laplacian, is unbounded on \(H = L^2\). But if we instead choose \(H = H^1_0\), then \(H^* = H^{-1}\) and \(\nabla F\) is bounded and Lipschitz continuous. In this case, the splitting of F(w) into \(f(w, \xi ^k)\) is less obvious than in our main application, but e.g. (randomized) domain decomposition as in  [25] is a natural idea. In each step, an elliptic problem then has to be solved (to apply \(\iota\)), but this can often be done very efficiently.

Our main theorem states that we have sub-linear convergence of the iterates \(w^k\) to \(w^*\) in expectation:

Theorem 1

Let Assumption 1 be fulfilled and let \(\{\xi ^k\}_{k \in \mathbb {N}}\) be a family of jointly independent random variables on \(\Omega\). Then the scheme (3) converges sub-linearly if the step sizes fulfill \(\alpha _k = \frac{\eta }{k}\) with \(\eta > \frac{1}{\mu }\). In particular, the error bound

$$\begin{aligned} \mathbf {E}_{k-1}\left[ \Vert w^k - w^* \Vert ^2 \right] \le \frac{C}{k} \end{aligned}$$

is fulfilled, where C depends on \(\Vert w_1 - w^* \Vert\), \(\mu\), \(\nu\), \(\sigma\), \(\eta\) and m.

When \(m=1\), there is a L such that \(L_{\xi }(R) \le L\) almost surely for all R and we have the explicit bound

$$\begin{aligned} C = \left( \Vert w^1 - w^* \Vert ^2 + \frac{2^{\mu \eta }\eta ^2 }{\mu \eta -1}\left(\sigma ^2 + 2L\sigma \left(\Vert w^1 - w^* \Vert ^2 + \sigma ^2 \sum _{j=1}^{k-1}{\alpha _j^2}\right)^{\frac{1}{2}} \right) \right)\mathrm {exp}\left(\frac{\nu ^2\eta ^2 \pi ^2}{4}\right) . \end{aligned}$$

For details on the error constant when \(m > 1\), we refer the reader to the proof, which is given in Sect. 4. We note that there is no upper bound on the step size \(\alpha _k\), as would be the case for an explicit method like SGD. There is still a lower bound, but this is not as critical. Similarly to the finite-dimensional case (see e.g. [32, Theorem 15]), the method still converges if the assumption \(\eta > \frac{1}{\mu }\) is not fulfilled, albeit at a slower rate \(\mathcal {O}(1/k^\gamma )\) with \(\gamma < 1\). This follows from a straightforward extension of Lemma 10 and the above theorem, but we omit these details for brevity. Moreover, we note that the exponential terms in the error constant are an artifact of the proof. They are not observed in practice and could likely be removed by the use of more refined algebraic inequalities.

The main idea of the proof is to acquire a contraction property of the form

$$\begin{aligned} \mathbf {E}_{k-1}\left[ \Vert w^k - w^* \Vert ^2 \right] \le C_k \mathbf {E}_{k-2}\left[ \Vert w^{k-1} - w^* \Vert ^2 \right] + \alpha _k^2 D, \end{aligned}$$

where \(C_k < 1\) and D are certain constants depending on the data. Inevitably, \(C_k \rightarrow 1\) as \(k \rightarrow \infty\), but because of the chosen step size sequence this happens slowly enough to still guarantee the optimal rate. To reach this point, we first show two things: First, an a priori bound of the form \(\mathbf {E}_{k-1}\left[ \Vert w^k - w^* \Vert ^2 \right] \le C\), i.e. unlike the SGD, the SPI is always stable regardless of how large the step size is. Secondly, that the resolvents \(T_{f, \xi }\) are contractive with

$$\begin{aligned} \mathbf {E}_{\xi } \left[\Vert T_{f, \xi } u - T_{f, \xi } v \Vert ^2\right] \le C_k\Vert u-v\Vert ^2. \end{aligned}$$

Similarly to [32], we do the latter by approximating the functions \(f(\cdot , \xi )\) by convex quadratic functions \(\tilde{f}(\cdot , \xi )\) for which the property is easier to verify, and then establishing a relation between the approximated and the true contraction factors. The series of lemmas in the next section is devoted to this preparatory work.

3 Preliminaries

First, let us show that the scheme is in fact well-defined, in the sense that every iterate is measurable if the random variables \(\{\xi ^k\}_{k \in \mathbb {N}}\) are.

Lemma 1

Let Assumption 1 be fulfilled. Further, let \(\{\xi ^k\}_{k \in \mathbb {N}}\) be a family of jointly independent random variables. Then for every \(k \in \mathbb {N}\) there exists a unique mapping \(w^{k+1} :\Omega \rightarrow \mathcal {D}\left( \nabla f \right)\) that fulfills (3) and is measurable with respect to the \(\sigma\)-algebra generated by \(\xi ^1, \ldots , \xi ^k\).

Proof

We define the mapping

$$\begin{aligned} h :\mathcal {D}\left( \nabla f \right) \times \Omega \rightarrow H, \quad (u, \omega ) \mapsto w^k - (I + \alpha _k \iota \nabla f(\cdot , \xi ^k(\omega ))) u. \end{aligned}$$

For almost all \(\omega \in \Omega\), the mapping \(f(\cdot , \xi ^k(\omega ))\) is lower semi-continuous, proper and convex. Thus, by [27, Theorem A] \(\nabla f(\cdot , \xi ^k(\omega ))\) is maximal monotone. By [4, Theorem 2.2], this shows that the operator \(\iota ^{-1} + \alpha _k \nabla f(\cdot , \xi ^k(\omega )) :\mathcal {D}\left( \nabla f \right) \rightarrow H^*\) is surjective. Note that the two previously cited results are stated for multi-valued operators. As we are in a more regular setting, the sub-differential of \(f(\cdot , \xi ^k(\omega ))\) only consists of a single element at each point. Therefore, it is possible to apply these multi-valued results also in our setting and interpret the appearing operators as single-valued. Furthermore, due to the monotonicity of \(\nabla f(\cdot , \xi ^k(\omega ))\) it follows that for \(u,v \in \mathcal {D}\left( \nabla f \right)\)

$$\begin{aligned} \langle \left(\iota ^{-1} + \alpha _k \nabla f(\cdot , \xi ^k(\omega ))\right) u - \left(\iota ^{-1} + \alpha _k \nabla f(\cdot , \xi ^k(\omega ))\right) v , u -v \rangle _{} \ge \Vert u-v\Vert ^2 \end{aligned}$$

which implies

$$\begin{aligned} \left\Vert \left(\iota ^{-1} + \alpha _k \nabla f(\cdot , \xi ^k(\omega ))\right) u - \left(\iota ^{-1} + \alpha _k \nabla f(\cdot , \xi ^k(\omega ))\right) v \right\Vert \ge \Vert u-v\Vert . \end{aligned}$$

This verifies that \(I + \alpha _k \iota \nabla f(\cdot , \xi ^k(\omega ))\) is injective. As we have proved that the operator is both injective and surjective, it is, in particular, bijective. Therefore, there exists a unique element \(w^{k+1}(\omega )\) such that

$$\begin{aligned} h(w^{k+1}(\omega ), \omega ) = w^k - (I + \alpha _k \iota \nabla f(\cdot , \xi ^k(\omega ))) w^{k+1}(\omega ) = 0. \end{aligned}$$

We can now apply [14, Lemma 2.1.4] or [15, Lemma  4.3] and obtain that \(\omega \mapsto w^{k+1}(\omega )\) is measurable. \(\square\)

Proving that the scheme is always stable is relatively straightforward, as shown in the next lemma. With some extra effort, we also get stability in stronger norms, i.e. we can bound not only \(\mathbf {E}_{k}\left[\Vert w^{k+1} - w^*\Vert ^2\right]\) but also higher moments \(\mathbf {E}_{k}\left[\Vert w^{k+1} - w^*\Vert ^{2^m}\right]\), \(m \in \mathbb {N}\). This will be important since we only have the weaker local Lipschitz continuity stated in Assumption 1 rather than global Lipschitz continuity. The idea of the proof stems from a standard technique mostly applied in the field of evolution equations in a variational framework, compare for example [31, Lemma 8.6]. The main difficulty is to incorporate the stochastic gradient in the presentation.

Lemma 2

Let Assumption 1 be fulfilled, and suppose that \(\sum _{k=1}^{\infty }{\alpha _k^2} < \infty\). Then there exists a constant \(D \ge 0\) depending only on \(\Vert w_1 - w^*\Vert\), \(\sum _{k=1}^{\infty }{\alpha _k^2}\) and \(\sigma\), such that

$$\begin{aligned} \mathbf {E}_{k}\left[\Vert w^{k+1} - w^*\Vert ^{2^m}\right] \le D \end{aligned}$$

for all \(k \in \mathbb {N}\).

Proof

Within the proof, we abbreviate the function \(f(\cdot , \xi ^k)\) by \(f_k\), \(k \in \mathbb {N}\). First, we consider the case \(m = 1\). Recall the identity \(( a - b , a )_{} = \frac{1}{2} \left(\Vert a\Vert ^2 - \Vert b\Vert ^2 + \Vert a-b\Vert ^2\right)\), \(a,b \in H\). We write the scheme as

$$\begin{aligned} w^{k+1} - w^k + \alpha _k \iota \nabla f_k(w^{k+1}) = 0, \end{aligned}$$

subtract \(\alpha _k \iota \nabla f_k(w^{*})\) from both sides, multiply by two and test it with \(w^{k+1} - w^*\) to obtain

$$\begin{aligned}&\Vert w^{k+1} - w^*\Vert ^2 - \Vert w^k - w^*\Vert ^2 + \Vert w^{k+1} -w^k\Vert ^2 \\&\qquad + 2 \alpha _k ( \iota \nabla f_k(w^{k+1}) - \iota \nabla f_k(w^*) , w^{k+1} - w^* )_{} \\&\quad = - 2 \alpha _k ( \iota \nabla f_k(w^*) , w^{k+1} - w^* )_{}. \end{aligned}$$

For the right-hand side, we have by Young’s inequality that

$$\begin{aligned}&- 2 \alpha _k ( \iota \nabla f_k(w^*) , w^{k+1} - w^* )_{} \\&\quad = - 2 \alpha _k \langle \nabla f_k(w^*) , w^{k+1} - w^k\rangle _{} - 2 \alpha _k \langle \nabla f_k(w^*) , w^k - w^*\rangle _{}\\&\quad \le 2 \alpha _k \Vert \nabla f_k(w^*)\Vert _{H^*} \Vert w^{k+1} - w^k\Vert - 2 \alpha _k \langle \nabla f_k(w^*) , w^k - w^*\rangle _{}\\&\quad \le \alpha _k^2 \Vert \nabla f_k(w^*)\Vert _{H^*}^2 + \Vert w^{k+1} - w^k\Vert ^2 - 2 \alpha _k \langle \nabla f_k(w^*) , w^k - w^*\rangle _{}. \end{aligned}$$

Together with the monotonicity condition, it then follows that

$$\begin{aligned} \begin{aligned} \Vert w^{k+1} - w^*\Vert ^2 - \Vert w^k - w^*\Vert ^2&\le \alpha _k^2 \Vert \nabla f_k(w^*)\Vert _{H^*}^2- 2 \alpha _k \langle \nabla f_k(w^*) , w^k - w^*\rangle _{}. \end{aligned} \end{aligned}$$
(4)

Since \(w^k - w^*\) is independent of \(\xi ^k\) and \(\mathbf {E}_{\xi ^k}[\nabla f_k(w^*)] = \nabla F(w^*) = 0\), taking the expectation \(\mathbf {E}_{\xi ^k}\) thus leads to the following bound:

$$\begin{aligned} \mathbf {E}_{\xi ^k}\left[\Vert w^{k+1} - w^*\Vert ^2\right] \le \Vert w^k - w^*\Vert ^2 + \alpha _k^2 \sigma ^2. \end{aligned}$$

Repeating this argument, we obtain that

$$\begin{aligned} \mathbf {E}_{k}\left[\Vert w^{k+1} - w^*\Vert ^2\right] \le \Vert w_1 - w^*\Vert ^2 + \sigma ^2 \sum _{j = 1}^{k}\alpha _{j}^2. \end{aligned}$$
(5)

In order to find the higher moment bound, we recall (4). We then follow a similar idea as in [10, Lemma 3.1], where we multiply this inequality with \(\Vert w^{k+1} - w^*\Vert ^2\) and use the identity \((a - b)a = \frac{1}{2} \left(|a |^2 - |b |^2 + |a-b |^2\right)\) for \(a,b \in \mathbb {R}\). It then follows that

$$\begin{aligned}&\Vert w^{k+1} - w^*\Vert ^4 - \Vert w^k - w^*\Vert ^4 + \left|\Vert w^{k+1} - w^*\Vert ^2 - \Vert w^k - w^*\Vert ^2 \right|^2\\&\quad \le \left( \alpha _k^2 \Vert \nabla f_k(w^*)\Vert _{H^*}^2- 2 \alpha _k \langle \nabla f_k(w^*) , w^k - w^*\rangle _{}\right) \Vert w^{k+1} - w^*\Vert ^2\\&\quad \le \left( \alpha _k^2 \Vert \nabla f_k(w^*)\Vert _{H^*}^2- 2 \alpha _k \langle \nabla f_k(w^*) , w^k - w^*\rangle _{} \right) \\&\qquad \times \left(\Vert w^k - w^*\Vert ^2 + \alpha _k^2 \Vert \nabla f_k(w^*)\Vert _{H^*}^2- 2 \alpha _k \langle \nabla f_k(w^*) , w^k - w^*\rangle _{}\right)\\&\quad \le \alpha _k^2 \Vert w^k - w^*\Vert ^2 \Vert \nabla f_k(w^*)\Vert _{H^*}^2- 2 \alpha _k \Vert w^k - w^*\Vert ^2 \langle \nabla f_k(w^*) , w^k - w^*\rangle _{} \\&\qquad + \alpha _k^4 \Vert \nabla f_k(w^*)\Vert _{H^*}^4- 4 \alpha _k^3 \Vert \nabla f_k(w^*)\Vert _{H^*}^2 \langle \nabla f_k(w^*) , w^k - w^*\rangle _{}\\&\qquad + 4 \alpha _k^2 \left(\langle \nabla f_k(w^*) , w^k - w^*\rangle _{}\right)^2. \end{aligned}$$

Applying Young’s inequality to the first and fourth term of the previous row then implies that

$$\begin{aligned}&\Vert w^{k+1} - w^*\Vert ^4 - \Vert w^k - w^*\Vert ^4\\&\quad \le \frac{\alpha _k^2}{2}\Vert w^k - w^*\Vert ^4 - 2 \alpha _k \Vert w^k - w^*\Vert ^2 \langle \nabla f_k(w^*) , w^k - w^*\rangle _{} \\&\qquad + \left(3\alpha _k^4 + \frac{\alpha _k^2}{2}\right) \Vert \nabla f_k(w^*)\Vert _{H^*}^4 + 6 \alpha _k^2 \Vert \nabla f_k(w^*)\Vert _{H^*}^2 \Vert w^k - w^*\Vert ^2\\&\quad \le \frac{\alpha _k^2}{2}\Vert w^k - w^*\Vert ^4 - 2 \alpha _k \Vert w^k - w^*\Vert ^2 \langle \nabla f_k(w^*) , w^k - w^*\rangle _{} \\&\quad \quad + \left(3\alpha _k^4 + \frac{\alpha _k^2}{2}\right) \Vert \nabla f_k(w^*)\Vert _{H^*}^4 + 3 \alpha _k^2 \Vert \nabla f_k(w^*)\Vert _{H^*}^4 + 3 \alpha _k^2 \Vert w^k - w^*\Vert ^4\\&\quad \le \frac{7\alpha _k^2}{2}\Vert w^k - w^*\Vert ^4 - 2 \alpha _k \Vert w^k - w^*\Vert ^2 \langle \nabla f_k(w^*) , w^k - w^*\rangle _{} \\&\quad \quad + \left(3\alpha _k^4 + \frac{7\alpha _k^2}{2}\right) \Vert \nabla f_k(w^*)\Vert _{H^*}^4. \end{aligned}$$

Summing up from \(j=1\) to k and taking the expectation \(\mathbf {E}_{k}\), yields

$$\begin{aligned}&\mathbf {E}_{k} \left[\Vert w^{k+1} - w^*\Vert ^4\right] \\&\quad \le \Vert w_1 - w^*\Vert ^4 + \sum _{j = 1}^{k} \frac{7\alpha _j^2}{2} \mathbf {E}_{j-1} \left[\Vert w^j - w^*\Vert ^4 \right] + \sigma ^4 \sum _{j=1}^{k} \left(3\alpha _j^4 + \frac{7\alpha _j^2}{2}\right). \end{aligned}$$

We then apply the discrete Grönwall inequality for sums (see, e.g., [11]) which shows that

$$\begin{aligned} \mathbf {E}_k \left[\Vert w^{k+1} - w^*\Vert ^4\right] \le \left(\Vert w_1 - w^*\Vert ^4 + \sigma ^4 \sum _{j=1}^{k} \left(3\alpha _j^4 + \frac{7\alpha _j^2}{2}\right)\right) \mathrm {exp}\left(\frac{7}{2}\sum _{j = 1}^{k} \alpha _j^2\right). \end{aligned}$$

For the next higher bound \(\mathbf {E}_k \left[\Vert w^{k+1} - w^*\Vert ^8\right]\), we recall that

$$\begin{aligned}&\Vert w^{k+1} - w^*\Vert ^4 - \Vert w^k - w^*\Vert ^4\\&\quad \le \frac{7\alpha _k^2}{2}\Vert w^k - w^*\Vert ^4 - 2 \alpha _k \Vert w^k - w^*\Vert ^2 \langle \nabla f_k(w^*) , w^k - w^*\rangle _{} \\&\quad \quad + \left(3\alpha _k^4 + \frac{7\alpha _k^2}{2}\right) \Vert \nabla f_k(w^*)\Vert _{H^*}^4, \end{aligned}$$

which we can multiply with \(\Vert w^{k+1} - w^*\Vert ^4\) in order to follow the same strategy as before. Following this approach, we find bounds for \(\mathbf {E}_k \left[\Vert w^{k+1} - w^*\Vert ^{2^m}\right]\) recursively for all \(m \in \mathbb {N}\). \(\square\)

Remark 3

In particular, Lemma 2 implies that there exists a constant D depending on \(\Vert w_1 - w^*\Vert\), \(\sum _{k=1}^{\infty }{\alpha _k^2}\) and \(\sigma\) such that

$$\begin{aligned} \mathbf {E}_{k}\left[\Vert w^{k+1} - w^*\Vert ^{p}\right] \le D \end{aligned}$$

for all \(p \le 2^m\) and \(k \in \mathbb {N}\). Further, comparing (5)

$$\begin{aligned} \mathbf {E}_k{\left[ \Vert w^{k+1} - w^* \Vert _{}^2\right] } \le \Vert w_1 - w^* \Vert _{} + \sum _{i=1}^k \alpha _i^2 \mathbf {E}_{\xi ^i}{\left[ \Vert \nabla f(w^*, \xi ^i) \Vert _{}^2 \right] }, \end{aligned}$$

to the corresponding bound for the SGD

$$\begin{aligned} \mathbf {E}_k{\left[ \Vert w^{k+1} - w^* \Vert _{}^2\right] } \le \Vert w_1 - w^* \Vert _{} + \sum _{i=1}^k \alpha _i^2 \mathbf {E}_{i}{\left[ \Vert \nabla f(w^i, \xi ^i) \Vert _{}^2 \right] }, \end{aligned}$$

indicates that the SPI has a smaller a priori bound than the SGD. This bound plays a crucial part in the error constant in the convergence proof of Theorem 1. In practice one would expect the terms \(\mathbf {E}_{\xi ^i}{\left[\Vert \nabla f(w^*,\xi ^i) \Vert _{}^2 \right]}\) to be significantly smaller than \(\mathbf {E}_{i}{\left[\Vert \nabla f_i(w^i,\xi ^i) \Vert _{}^2 \right]}\) if the variance of \(\nabla f(\cdot , \xi ^i)\) is small. Note that since we assume that we have an unbiased estimate, the variance is given by \(\mathbf {E}_{\xi ^i}\left[{\Vert \nabla f(w,\xi ^i) \Vert _{}}^2\right] -\Vert \mathbf {E}_{\xi ^i}\left[{\nabla f(w, \xi ^i)}\right] \Vert _{}^2 = \mathbf {E}_{\xi ^i}\left[{\Vert \nabla f(w,\xi ^i) \Vert _{}}^2\right]\).

Following Ryu and Boyd [32], we now introduce the function \(\tilde{f}(\cdot ,\xi ) :H \times \Omega \rightarrow (-\infty ,\infty ]\) given by

$$\begin{aligned} \tilde{f}(u,\xi )=f(u_0,\xi ) + \langle \nabla f(u_0,\xi ) , u - u_0\rangle _{}+ \frac{1}{2} ( M_{\xi }(u - u_0) , u - u_0 )_{}, \end{aligned}$$
(6)

where \(u_0 \in \mathcal {D}\left( \nabla f \right)\) is a fixed parameter. This mapping is a convex approximation of f. Furthermore, we define the function \(\tilde{r}(\cdot ,\xi ) :H \times \Omega \rightarrow (-\infty ,\infty ]\) given by

$$\begin{aligned} \tilde{r}(u,\xi ) = f(u,\xi ) - \tilde{f}(u,\xi ). \end{aligned}$$
(7)

Their gradients \(\nabla \tilde{f}(\cdot ,\xi ) :H \times \Omega \rightarrow H^*\) and \(\nabla \tilde{r}(\cdot ,\xi ) :\mathcal {D}\left( \nabla f \right) \times \Omega \rightarrow H^*\) can be stated as

$$\begin{aligned} \nabla \tilde{f}(u,\xi )&= \nabla f(u_0,\xi ) + ( M_{\xi }(u - u_0) , \cdot )_{}, \quad u \in H,\\ \nabla \tilde{r}(u,\xi )&= \nabla f(u,\xi ) - \nabla f(u_0,\xi ) - ( M_{\xi }(u - u_0) , \cdot )_{}, \quad u \in \mathcal {D}\left( \nabla f \right) \end{aligned}$$

almost surely. In the following lemma, we collect some standard properties of these operators.

Lemma 3

The function \(\tilde{r}(\cdot , \xi )\) defined in (7) is convex almost surely, i.e., it fulfills \(\tilde{r}(u,\xi ) \ge \tilde{r}(v,\xi ) + \langle \nabla \tilde{r}(v,\xi ) , u - v\rangle _{}\) for all \(u,v \in \mathcal {D}\left( \nabla f \right)\) almost surely. As a consequence, the gradient \(\nabla \tilde{r}(\cdot ,\xi )\) is monotone almost surely.

Proof

In the following proof, let us omit \(\xi\) for simplicity and let \(u,v \in \mathcal {D}\left( \nabla f \right)\) be given. Due to the monotonicity property of \(\nabla f\) stated in Assumption 1, it follows that

$$\begin{aligned} f(u) \ge f(v) + \langle \nabla f(v) , u - v\rangle _{} + \frac{1}{2}( M(u - v) , u - v )_{}. \end{aligned}$$

For the function \(\tilde{f}\) we can write

$$\begin{aligned} \tilde{f}(u)&= f(u_0) + \langle \nabla f(u_0) , u - u_0\rangle _{}+ \frac{1}{2} ( M(u - u_0) , u - u_0 )_{},\\ \nabla \tilde{f}(u)&= \nabla f(u_0) + ( M(u -u_0) , \cdot )_{} \quad \text {and} \quad \nabla ^2 \tilde{f}(u) = M. \end{aligned}$$

All further derivatives are zero. Thus, we can use a Taylor expansion around v to write

$$\begin{aligned} \tilde{f}(u)&= \tilde{f}(v) + \langle \nabla \tilde{f}(v) , u - v\rangle _{} + \frac{1}{2}( M(u - v) , u - v )_{}. \end{aligned}$$

It then follows that

$$\begin{aligned} \tilde{r}(u)&\ge f(v) + \langle \nabla f(v) , u - v\rangle _{} + \frac{1}{2}( M(u - v) , u - v )_{}\\&\quad - \left(\tilde{f}(v) + \langle \nabla \tilde{f}(v) , u - v\rangle _{} + \frac{1}{2}( M(u - v) , u - v )_{}\right)\\&= \tilde{r}(v) + \langle \nabla \tilde{r}(v) , u - v\rangle _{}. \end{aligned}$$

By [41, Proposition 25.10], it follows that \(\nabla \tilde{r}\) is monotone. \(\square\)

The following lemma demonstrates that the resolvents \(T_{\tilde{f}, \xi }\) and certain perturbations of them are well-defined. Furthermore, we will provide a more explicit formula for such resolvents. A comparable result is mentioned in [32, page 10], we include a proof for the sake of completeness.

Lemma 4

Let Assumption 1 be fulfilled and let \(\tilde{f}(\cdot , \xi )\) be defined as in (6). Then the operator

$$\begin{aligned} T_{\tilde{f}, \xi } = (I + \iota \nabla \tilde{f}(\cdot ,\xi ))^{-1} :H \times \Omega \rightarrow H \end{aligned}$$

is well-defined. If a function \(r(\cdot , \xi ) :H \times \Omega \rightarrow (-\infty , \infty ]\) is Gâteaux differentiable with the common domain \(\mathcal {D}\left( \nabla r \right) = \mathcal {D}\left( \nabla f \right)\), lower semi-continuous, convex and proper almost surely, then

$$\begin{aligned} T_{\tilde{f}+ r, \xi } = (I + \iota \nabla \tilde{f}(\cdot ,\xi ) + \iota \nabla r(\cdot ,\xi ))^{-1} :H \times \Omega \rightarrow \mathcal {D}\left( \nabla f \right) \end{aligned}$$

is well-defined.

If there exist \(Q_{\xi } :\mathcal {D}\left( \nabla f \right) \times \Omega \rightarrow H^*\) and \(z_{\xi } :\Omega \rightarrow H^*\) such that \(\nabla r (u,\xi ) = Q_{\xi } u + z_{\xi }\) then the resolvent can be represented by

$$\begin{aligned} T_{\tilde{f} + r, \xi } u = (I + M_{\xi } + \iota Q_{\xi })^{-1} \left(u - \iota \nabla f(u_0,\xi ) + M_{\xi }u_0 - \iota z_{\xi }\right). \end{aligned}$$

Proof

For simplicity, let us omit \(\xi\) again. In order to prove that \(T_{\tilde{f}}\) and \(T_{\tilde{f} + r}\) are well-defined, we can apply [27, Theorem A] and [4, Theorem 2.2] analogously to the argumentation in the proof of Lemma 1.

Assuming that \(\nabla r (u) = Q u + z\), we find an explicit representation for \(T_{\tilde{f} + r}\). To this end, for \(v \in H\), consider

$$\begin{aligned} (I + \iota \nabla \tilde{f} + \iota \nabla r)^{-1} v = T_{\tilde{f} + r} v =: u \in \mathcal {D}\left( \nabla f \right) . \end{aligned}$$

Then it follows that

$$\begin{aligned} v = (I + \iota \nabla \tilde{f} + \iota \nabla r) u = (I + M + \iota Q) u + \iota \nabla f(u_0) - Mu_0 + \iota z. \end{aligned}$$

Rearranging the terms, yields

$$\begin{aligned} T_{\tilde{f} + r} v&= (I + M + \iota Q)^{-1} \left(v - \iota \nabla f(u_0) + Mu_0 - \iota z\right). \end{aligned}$$

\(\square\)

Next, we will show that the contraction factors of \(T_{f, \xi }\) and \(T_{\tilde{f}, \xi }\) are related. For this, we need the following basic identities and some stronger inequalities that hold for symmetric positive operators on H. These results are fairly standard and similar statements can be found in [32, Lemma 9 and Lemma 10]. For the sake of completeness, we provide an alternative proof that is better adapted to our notation.

Lemma 5

Let Assumption 1 be satisfied and let \(\tilde{f}(\cdot , \xi )\) and \(\tilde{r}(\cdot , \xi )\) be given as in (6) and (7), respectively. Then the identities

$$\begin{aligned} \iota \nabla f (T_{f, \xi }, \xi ) = I - T_{f,\xi } \quad \text {and} \quad \iota \nabla \tilde{f} (T_{f,\xi },\xi ) + T_{f,\xi } - I = - \iota \nabla \tilde{r} (T_{f,\xi },\xi ) \end{aligned}$$

are fulfilled almost surely.

Proof

By the definition of \(T_{f,\xi }\), we have that

$$\begin{aligned} T_{f,\xi } + \iota \nabla f (T_{f,\xi },\xi ) = (I + \iota \nabla f(\cdot ,\xi ) ) T_{f,\xi } = I, \end{aligned}$$

from which the first claim follows immediately. The second identity then follows from

$$\begin{aligned} \iota \nabla \tilde{f} (T_{f,\xi },\xi ) + T_{f,\xi } - I = \iota \nabla \tilde{f} (T_{f,\xi },\xi ) - \iota \nabla f (T_{f,\xi },\xi ) = - \iota \nabla \tilde{r} (T_{f,\xi },\xi ) . \end{aligned}$$

\(\square\)

As a consequence of Lemma 5 we have the following basic inequalities:

Lemma 6

Let Assumption 1 be satisfied. It then follows that

$$\begin{aligned} \Vert T_{f,\xi } u - u\Vert \le \Vert \nabla f(u,\xi )\Vert _{H^*} \end{aligned}$$

almost surely for every \(u \in \mathcal {D}\left( \nabla f \right)\). Additionally, if for \(R >0\) the bound \(\Vert u\Vert + \Vert \nabla f(u,\xi )\Vert \le R\) holds true almost surely, then

$$\begin{aligned} \Vert \iota ^{-1}(T_{f,\xi } u - u) + \nabla f(u,\xi )\Vert _{H^*} \le L_{\xi }(R)\Vert \nabla f(u,\xi )\Vert _{H^*} \end{aligned}$$

is fulfilled almost surely.

Proof

In order to shorten the notation, we omit the \(\xi\) in the following proof and let u be in \(\mathcal {D}\left( \nabla f \right)\). For the first inequality, we note that since \(\nabla f\) is monotone, we have

$$\begin{aligned} \langle \nabla f (T_f u) - \nabla f(u) , T_f u - u\rangle _{} \ge 0. \end{aligned}$$

Thus, by the first identity in Lemma 5, it follows that

$$\begin{aligned} \langle -\nabla f(u) , T_f u - u\rangle _{}&= \langle \nabla f(T_fu) - \nabla f(u) , T_f u - u\rangle _{} - \langle \nabla f (T_f u ) , T_f u - u\rangle _{}\\&\ge \langle \iota ^{-1}(T_f u - u) , T_f u - u\rangle _{}\\&= ( T_f u - u , T_f u - u )_{} = \Vert T_f u - u \Vert ^2. \end{aligned}$$

But by the Cauchy-Schwarz inequality, we also have

$$\begin{aligned} \langle -\nabla f(u) , T_f u - u\rangle _{} \le \Vert \nabla f(u)\Vert _{H^*} \Vert T_f u - u \Vert , \end{aligned}$$

which in combination with the previous inequality proves the first claim.

The second inequality follows from the first part of this lemma. Because

$$\begin{aligned} \Vert T_f u\Vert \le \Vert T_f u - u\Vert + \Vert u\Vert \le \Vert \nabla f(u) \Vert _{H^*} + \Vert u\Vert , \end{aligned}$$

both u and \(T_f u\) are in a ball of radius R. Thus, we obtain

$$\begin{aligned} \Vert \iota ^{-1} (T_f u - u) + \nabla f(u)\Vert _{H^*}&= \Vert \nabla f(u) - \nabla f(T_f u)\Vert _{H^*} \\&\le L_{}(R) \Vert u - T_f u\Vert \le L_{}(R) \Vert \nabla f(u)\Vert _{H^*}. \end{aligned}$$

Lemma 7

Let \(Q, S \in \mathcal {L}(H)\) be symmetric operators. Then the following holds:

  • If Q is invertible and S and \(Q^{-1}\) are strictly positive, then \((Q + S)^{-1} < Q^{-1}\). If S is only positive, then \((Q + S)^{-1} \le Q^{-1}\).

  • If Q is a positive and contractive operator, i.e. \(\Vert Qu\Vert \le \Vert u\Vert\) for all \(u \in H\), then it follows that \(\Vert Qu\Vert ^2 \le ( Qu , u )_{}\) for all \(u \in H\).

  • If Q is a strongly positive invertible operator, such that there exists \(\beta > 0\) with \(( Q u , u )_{} \ge \beta \Vert u\Vert ^2\) for all \(u \in H\), then \(\Vert Q u \Vert \ge \beta \Vert u\Vert\) for all \(u \in H\) and \(\Vert Q^{-1}\Vert _{\mathcal {L}(H)} \le \frac{1}{\beta }\).

Proof

We start by expressing \((Q + S)^{-1}\) in terms of \(Q^{-1}\) and S, similar to the Sherman-Morrison-Woodbury formula for matrices [18]. First observe that the operator \((I + Q^{-1}S)^{-1} \in \mathcal {L}(H)\) by e.g. [19, Lemma 2A.1]. Then, since

$$\begin{aligned}&\left(Q^{-1} - Q^{-1}S\left(I + Q^{-1}S \right)^{-1}Q^{-1}\right) (Q+S) \\&\quad = I + Q^{-1}S - Q^{-1}S\left(I + Q^{-1}S \right)^{-1}\left(I + Q^{-1}S \right) = I \end{aligned}$$

and

$$\begin{aligned}&(Q+S)\left(Q^{-1} - Q^{-1}S\left(I + Q^{-1}S \right)^{-1}Q^{-1}\right) \\&\quad = I + SQ^{-1} - S\left(I + Q^{-1}S \right)\left(I + Q^{-1}S \right)^{-1}Q^{-1} = I, \end{aligned}$$

we find that

$$\begin{aligned} (Q + S)^{-1} = Q^{-1} - Q^{-1}S\left(I + Q^{-1}S \right)^{-1}Q^{-1}. \end{aligned}$$

Since \(Q^{-1}\) is symmetric, we see that \((Q + S)^{-1} < Q^{-1}\) if and only if \(S\left(I + Q^{-1}S \right)^{-1}\) is strictly positive. But this is true, as we see from the change of variables \(z = (I + Q^{-1}S)^{-1} u\). Because then

$$\begin{aligned} \left(S\left(I + Q^{-1}S \right)^{-1} u,u\right)_{}&= \left(Sz,z + Q^{-1}Sz\right)_{} = \left(Sz,z\right)_{} + \left(Q^{-1}Sz,Sz\right)_{} > 0 \end{aligned}$$

for any \(u \in H\), \(u \ne 0\), since S and \(Q^{-1}\) are strictly positive. If S is only positive, it follows analogously that \(\left(S\left(I + Q^{-1}S \right)^{-1} u,u\right)_{} \ge 0\).

In order to prove the second statement, we use the fact that there exists a unique symmetric and positive square root \(Q^{\nicefrac {1}{2}} \in \mathcal {L}(H)\) such that \(Q = Q^{\nicefrac {1}{2}}Q^{\nicefrac {1}{2}}\). Since \(\Vert Q\Vert = \sup _{x \in H} ( Q x , x )_{} = \sup _{x \in H} ( Q^{\frac{1}{2}} x , Q^{\frac{1}{2}} x )_{} = \Vert Q^{\nicefrac {1}{2}}\Vert ^2\), also \(Q^{\nicefrac {1}{2}}\) is contractive. Thus, it follows that

$$\begin{aligned} \Vert Q u\Vert ^2 = \Vert Q^{\nicefrac {1}{2}}Q^{\nicefrac {1}{2}}u\Vert ^2 \le \Vert Q^{\nicefrac {1}{2}}u\Vert ^2 = ( Q^{\nicefrac {1}{2}}u , Q^{\nicefrac {1}{2}}u )_{} = ( Qu , u )_{}. \end{aligned}$$

Now, we prove the third statement. First we notice that \(( Qu , u )_{} \ge \beta \Vert u\Vert ^2\) and \(( Qu , u )_{} \le \Vert Qu\Vert \Vert u\Vert\) imply that \(\Vert Qu\Vert \ge \beta \Vert u\Vert\) for all \(u \in H\). Substituting \(v = Q^{-1}u\), then shows \(\Vert v\Vert \ge \beta \Vert Q^{-1}v \Vert\), which proves the final claim. \(\square\)

The previous lemma now allows us to extend [32, Theorem 10], which we have reformulated and restructured to match our setting. It relates the contraction factors of the true and approximated operators.

Lemma 8

Let Assumption 1 be fulfilled and let \(\tilde{f}(\cdot , \xi )\) be given as in (6). Then

$$\begin{aligned} \mathbf {E}_{\xi } \left[\frac{\Vert T_{f, \xi } u - T_{f, \xi } v \Vert ^2}{\Vert u-v\Vert ^2} \right] \le \left( \mathbf {E}_{\xi } \left[\frac{\Vert T_{\tilde{f}, \xi } u - T_{\tilde{f}, \xi } v \Vert ^2}{\Vert u-v\Vert ^2} \right]\right)^{\nicefrac {1}{2}} \end{aligned}$$

holds for every \(u,v \in H\).

Proof

For better readability, we once again omit \(\xi\) where there is no risk of confusion. For \(u,v \in \mathcal {D}\left( \nabla f \right)\) with \(u \ne v\) and \(\varepsilon >0\), we approximate the function \(\tilde{r}(\cdot , \xi )\) defined in (7) by

$$\begin{aligned} \tilde{r}_{\varepsilon }(\cdot , \xi ) :H \times \Omega \rightarrow (-\infty ,\infty ], \quad \tilde{r}_{\varepsilon } (z,\xi ) = \langle \nabla \tilde{r}(T_f u,\xi ) , z\rangle _{} + \frac{\left(\langle v_{\varepsilon } , z - T_f u\rangle _{}\right)^2}{2 a_{\varepsilon }}, \end{aligned}$$

where

$$\begin{aligned} v_{\varepsilon } = - \nabla \tilde{r}(T_f u) + \nabla \tilde{r}(T_f v) + \varepsilon \iota ^{-1} (T_f v - T_f u) \in H \quad \text {and} \quad a_{\varepsilon } = \langle v_{\varepsilon } , T_f v - T_f u\rangle _{}. \end{aligned}$$

As we can write

$$\begin{aligned} a_{\varepsilon }&= \langle - \nabla \tilde{r}(T_f u) + \nabla \tilde{r}(T_f v) + \varepsilon \iota ^{-1} (T_f v - T_f u) , T_f v-T_f u\rangle _{}\\&= \langle \nabla \tilde{r}(T_f u) - \nabla \tilde{r}(T_f v) , T_f u-T_f v\rangle _{} + \varepsilon ( T_f v - T_f u , T_f v-T_f u )_{}\\&\ge \varepsilon \Vert T_f v - T_f u \Vert ^2 > 0, \end{aligned}$$

\(\tilde{r}_{\varepsilon }\) is well-defined. The derivative is given by \(\nabla \tilde{r}_{\varepsilon }(\cdot ,\xi ) :H \times \Omega \rightarrow H^*\),

$$\begin{aligned} \nabla \tilde{r}_{\varepsilon } (z) = \nabla \tilde{r}(T_f u) + \frac{\langle v_{\varepsilon } , z - T_f u\rangle _{}}{a_{\varepsilon }} v_{\varepsilon } = \frac{\langle v_{\varepsilon } , z\rangle _{}}{a_{\varepsilon }} v_{\varepsilon } + \nabla \tilde{r}(T_f u) - \frac{\langle v_{\varepsilon } , T_f u\rangle _{}}{a_{\varepsilon }} v_{\varepsilon }. \end{aligned}$$

This function \(\nabla \tilde{r}_{\varepsilon }\) is an interpolation between the points

$$\begin{aligned} \nabla \tilde{r}_{\varepsilon } (T_f u)&= \nabla \tilde{r}(T_f u) \quad \text {and}\\ \nabla \tilde{r}_{\varepsilon } (T_f v)&= \nabla \tilde{r}(T_f u) + \frac{\langle v_{\varepsilon } , T_f v - T_f u\rangle _{}}{a_{\varepsilon }} v_{\varepsilon }\\&= \nabla \tilde{r}(T_f u) + \frac{\langle v_{\varepsilon } , T_f v - T_f u\rangle _{}}{\langle v_{\varepsilon } , T_f v - T_f u\rangle _{}} v_{\varepsilon }\\&= \nabla \tilde{r}(T_f u) - \nabla \tilde{r}(T_f u) + \nabla \tilde{r}(T_f v) + \varepsilon \iota ^{-1} (T_f v - T_f u)\\&= \nabla \tilde{r}(T_f v) + \varepsilon \iota ^{-1} (T_f v - T_f u). \end{aligned}$$

Furthermore, since \(T_{\tilde{f} + \tilde{r}_{\varepsilon }} = (I + \iota \nabla \tilde{f} + \iota \nabla \tilde{r}_{\varepsilon })^{-1}\), it follows that

$$\begin{aligned} (I + \iota \nabla \tilde{f} + \iota \nabla \tilde{r}_{\varepsilon } ) T_f u&= T_f u + \iota \nabla \tilde{f}(T_f u) + \iota \nabla \tilde{r} (T_f u)\\&= T_f u + \iota \nabla f(T_f u) = (I + \iota \nabla f) T_f u = u, \end{aligned}$$

and therefore

$$\begin{aligned} T_f u = (I + \iota \nabla \tilde{f} + \iota \nabla \tilde{r}_{\varepsilon } )^{-1} u = T_{\tilde{f} + \tilde{r}_{\varepsilon }} u. \end{aligned}$$

Applying Lemma 5, we find that

$$\begin{aligned}&(I + \iota \nabla \tilde{f} + \iota \nabla \tilde{r}_{\varepsilon } ) T_f v\\&\quad = T_f v + \iota \nabla \tilde{f}(T_f v) + \iota \nabla \tilde{r} (T_f v) + \varepsilon (T_f v - T_f u)\\&\quad = T_f v + \iota \nabla f(T_f v) + \varepsilon (T_f v - T_f u)= v + \varepsilon (T_f v - T_f u). \end{aligned}$$

This shows that

$$\begin{aligned} T_f v = (I + \iota \nabla \tilde{f} + \iota \nabla \tilde{r}_{\varepsilon } )^{-1}(v + \varepsilon (T_f v - T_f u)) = T_{\tilde{f} + \tilde{r}_{\varepsilon }} (v + \varepsilon (T_f v - T_f u)). \end{aligned}$$
(8)

Using the explicit representation of \(T_{\tilde{f} + \tilde{r}_{\varepsilon }}\) from Lemma 4, it follows that

$$\begin{aligned} T_{\tilde{f} + \tilde{r}_{\varepsilon }} z= \left(I + M + \iota \left(\frac{\langle v_{\varepsilon } , \cdot \rangle }{a_{\varepsilon }} v_{\varepsilon }\right) \right)^{-1} \left(z - \iota \nabla f(u_0) + Mu_0 - \iota \left(\nabla \tilde{r}(T_f u) - \frac{\langle v_{\varepsilon } , T_f u\rangle _{}}{a_{\varepsilon }} v_{\varepsilon }\right) \right). \end{aligned}$$

Therefore, we have

$$\begin{aligned}&\Vert T_{\tilde{f} + \tilde{r}_{\varepsilon }} v - T_{\tilde{f} + \tilde{r}_{\varepsilon }} (v + \varepsilon (T_f v - T_f u))\Vert \\&\quad \le \left\Vert \left(I + M + \iota \left(\frac{\langle v_{\varepsilon } , \cdot \rangle _{}}{a_{\varepsilon }} v_{\varepsilon }\right) \right)^{-1} \right\Vert _{\mathcal {L}(H)} \Vert v - v - \varepsilon (T_f v - T_f u) \Vert \\&\quad \le \varepsilon \Vert T_f v - T_f u \Vert \rightarrow 0 \quad \text { as } \varepsilon \rightarrow 0, \end{aligned}$$

since

$$\begin{aligned} \left(\left(I + M + \iota \left(\frac{\langle v_{\varepsilon } , \cdot \rangle _{}}{a_{\varepsilon }} v_{\varepsilon }\right)\right) u,u\right)_{} \ge \Vert u\Vert ^2 \end{aligned}$$

means that we can apply Lemma 7. Thus, this shows that \(T_f u= T_{f + \tilde{r}_{\varepsilon }} u\) and \(T_f v = \lim _{\varepsilon \rightarrow 0} T_{\tilde{f} + \tilde{r}_{\varepsilon }} v\). Further, we can state an explicit representation for \(T_{\tilde{f}}\) using Lemma 4 given by

$$\begin{aligned} T_{\tilde{f}} z = (I + \iota \nabla \tilde{f} )^{-1} z = (I + M)^{-1} \left(z - \iota \nabla f(u_0) + Mu_0\right). \end{aligned}$$

For \(n = \frac{u-v}{\Vert u-v\Vert }\) with \(\Vert n\Vert = 1\), we obtain using Lemma 7

$$\begin{aligned} \frac{\Vert T_{\tilde{f}} u - T_{\tilde{f}} v \Vert }{\Vert u - v \Vert }&= \Vert (I + M)^{-1}n \Vert \\&\ge ( (I + M)^{-1} n , n )_{} \\&\ge \left(\left(I + M + \iota \left(\frac{\langle v_{\varepsilon } , \cdot \rangle _{}}{a_{\varepsilon }} v_{\varepsilon }\right) \right)^{-1} n,n\right)_{} \\&\ge \left\Vert \left(I + M + \iota \left(\frac{\langle v_{\varepsilon } , \cdot \rangle _{}}{a_{\varepsilon }} v_{\varepsilon }\right) \right)^{-1} n \right\Vert ^2 \\&= \frac{\Vert T_{\tilde{f} + \tilde{r}_{\varepsilon } } u - T_{\tilde{f} + \tilde{r}_{\varepsilon }} v \Vert ^2}{\Vert u - v \Vert ^2} \rightarrow \frac{\Vert T_f u - T_f v \Vert ^2}{\Vert u - v \Vert ^2} \quad \text { as } \varepsilon \rightarrow 0. \end{aligned}$$

Finally, as \(\mathbf {E}_{\xi } \left[ \frac{\Vert T_{\tilde{f}} u - T_{\tilde{f}} v \Vert }{\Vert u - v \Vert } \right]\) is finite, we can apply the dominated convergence theorem to obtain that

$$\begin{aligned} \mathbf {E}_{\xi } \left[ \frac{\Vert T_f u - T_f v \Vert ^2}{\Vert u - v \Vert ^2} \right] \le \mathbf {E}_{\xi } \left[ \frac{\Vert T_{\tilde{f}} u - T_{\tilde{f}} v \Vert }{\Vert u - v \Vert } \right] \le \left(\mathbf {E}_{\xi }\left[ \frac{\Vert T_{\tilde{f}} u - T_{\tilde{f}} v \Vert ^2}{\Vert u - v \Vert ^2} \right]\right)^{\frac{1}{2}}. \end{aligned}$$

\(\square\)

After having established a connection between the contraction properties of \(T_{f,\xi }\) and \(T_{\tilde{f},\xi }\), the next step is to provide a concrete result for the contraction factor of \(T_{\tilde{f},\xi }\). Applying Lemma 4, we can express this resolvent in terms of \(M_{\xi }\), which is easier to handle due to its linearity. The following lemma extends [32, Theorem 11]. As we are in an infinite dimensional setting, we can no longer argue using the smallest eigenvalue of an operator. This proof instead uses the convexity parameters directly. Moreover, we provide an explicit, non-asymptotic, bound for the contraction constant.

Lemma 9

Let Assumption 1 be satisfied and let \(\tilde{f}(\cdot ,\xi )\) be given as in (6). Then for \(u, v \in H\) and \(\alpha > 0\),

$$\begin{aligned} \mathbf {E}_{\xi }\left[ \Vert T_{\alpha \tilde{f}, \xi } u - T_{\alpha \tilde{f}, \xi } v \Vert ^2\right] < \mathbf {E}_{\xi }\left[ \Vert (I + \alpha M_{\xi })^{-1} \Vert _{\mathcal {L}(H)}^2 \right] \Vert u- v\Vert ^2 \end{aligned}$$

is fulfilled. Furthermore, it follows that

$$\begin{aligned} \mathbf {E}_{\xi }\left[ \Vert (I + \alpha M_{\xi })^{-1} \Vert _{\mathcal {L}(H)}^2 \right] < 1 - 2\mu \alpha + 3\nu ^2 \alpha ^2. \end{aligned}$$

Proof

Due to the explicit representation of \(T_{\alpha \tilde{f}, \xi }\) stated in Lemma 4, we find that

$$\begin{aligned} T_{\alpha \tilde{f}, \xi } u - T_{\alpha \tilde{f}, \xi } v = (I + \alpha M_{\xi })^{-1}(u-v) \end{aligned}$$

for \(u,v \in H\). As \(u-v\) does not depend on \(\Omega\), it follows that

$$\begin{aligned} \mathbf {E}_{\xi }\left[ \Vert (I + \alpha M_{\xi })^{-1} (u - v) \Vert ^2\right] \le \mathbf {E}_{\xi }\left[ \Vert (I + \alpha M_{\xi })^{-1} \Vert _{\mathcal {L}(H)}^2 \right] \Vert u - v \Vert ^2. \end{aligned}$$

Thus, we have reduced the problem to a question about “how contractive” the resolvent of \(M_{\xi }\) is in expectation. We note that for any \(u \in H\), we have

$$\begin{aligned} ( (I + \alpha M_{\xi })u , u )_{} \ge (1 + \mu _{\xi } \alpha ) \Vert u\Vert ^2. \end{aligned}$$

Due to Lemma 7 it follows that

$$\begin{aligned} \Vert (I + \alpha M_{\xi })^{-1}\Vert _{\mathcal {L}(H)}^2 \le (1 + \mu _{\xi } \alpha )^{-2} . \end{aligned}$$

The right-hand-side bound is a \(C^2(-\frac{1}{\mu _{\xi }},\infty )\)-function with respect to \(\alpha\) or even a \(C^2(\mathbb {R})\)-function if \(\mu _{\xi } = 0\). By a second-order expansion in a Taylor series we can therefore conclude that

$$\begin{aligned} \Vert (I + \alpha M_{\xi })^{-1}\Vert _{\mathcal {L}(H)}^2 \le 1 - 2 \mu _{\xi } \alpha + 3 \mu _{\xi }^2 \alpha ^2. \end{aligned}$$

Combining these results, we obtain

$$\begin{aligned} \mathbf {E}_{\xi } \left[\Vert (I + \alpha M_{\xi })^{-1} \Vert _{\mathcal {L}(H)}^2\right] \le \mathbf {E}_{\xi } \left[ 1 - 2 \mu _{\xi } \alpha + 3 \mu _{\xi }^2 \alpha ^2 \right] = 1 - 2\mu \alpha + 3\nu ^2 \alpha ^2. \end{aligned}$$

\(\square\)

Finally, the proof of the main theorem relies on iterating the step-wise bounds arising from the contraction properties of the resolvents which we just established. This leads to certain products of the contraction factors. The following algebraic inequalities show that these are bounded in the desired way. While this type of result has been stated previously for first-order polynomials in 1/j (see e.g. [24, Theorem 14]), we prove here a particular version for second-order polynomials that matches the approximation of the contraction factor stated in Lemma 9.

Lemma 10

Let \(C_1, C_2>0\), \(p>0\) and \(r \ge 0\) satisfy \(C_1p > r\) and \(4C_2 \ge C_1^2\). Then the following inequalities are satisfied:

  1. (i)

    \(\prod _{j= 1}^k \left(1-\frac{C_1}{j} + \frac{C_2}{j^2}\right)^{p} \le \mathrm {exp}\left(\frac{C_2 p \pi ^2}{6}\right) (k+1)^{-C_1p}\),

  2. (ii)

    \(\sum _{j=1}^k \frac{1}{ j^{1+r}} \prod _{i = j+1}^k \left(1-\frac{C_1}{i} + \frac{C_2}{i^2}\right)^{p} \le 2^{C_1p} \mathrm {exp}\left(\frac{C_2 p \pi ^2}{6}\right) \frac{1}{C_1 p-r} (k+1)^{-r}.\)

Proof

The proof relies on the trivial inequality \(1 + u \le \mathrm {e}^{u}\) for \(u \ge - 1\) and the following two basic inequalities involving (generalized) harmonic numbers

$$\begin{aligned} \ln {(k+1)} - \ln {(m)} \le \sum _{i=m}^k \frac{1}{i} \quad \text {and} \quad \sum _{i=1}^k{i^{C-1}} \le \frac{1}{C} (k+1)^C . \end{aligned}$$

The first one follows quickly by treating the sum as a lower Riemann sum approximating the integral \(\int _{m}^{k+1} {u^{-1} \,\mathrm {d}u}\). The second one can be proved analogously by approximating the integral \(\int _0^{k+1} {u^{C-1}\,\mathrm {d}u}\) with an upper (\(C<1\)) or lower (\(C>1\)) Riemann sum.

The condition \(4C_2 \ge C_1^2\) implies that all the factors in the product (i) are positive. We therefore have that \(0\le 1-\frac{C_1}{j} + \frac{C_2}{j^2} \le \mathrm {exp}\big(-\frac{C_1}{j}\big) \mathrm {exp}\big(\frac{C_2}{j^2}\big)\). Thus, it follows that

$$\begin{aligned} \prod _{j= 1}^k \left(1-\frac{C_1}{j} + \frac{C_2}{j^2}\right)^{p}&\le \mathrm {exp}\left(-C_1 p \sum _{j=1}^k{\frac{1}{j}}\right) \mathrm {exp}\left( C_2 p \sum _{j=1}^k{\frac{1}{j^2}}\right) \\&\le \mathrm {exp}\left(-C_1 p \ln {(k+1)} \right) \mathrm {exp}\left(\frac{C_2 p\pi ^2}{6}\right), \end{aligned}$$

from which the first claim follows directly. For the second claim, we similarly have

$$\begin{aligned}&\sum _{j=1}^k \frac{1}{j^{1+r}} \prod _{i = j+1}^k \left(1-\frac{C_1}{i} + \frac{C_2}{i^2}\right)^{p} \le \mathrm {exp}\left(\frac{C_2 p \pi ^2}{6}\right)\sum _{j=1}^k \frac{1}{j^{1+r}} \mathrm {exp}\left(-C_1p \sum _{i=j+1}^k{\frac{1}{i}} \right) , \end{aligned}$$

where the latter sum can be bounded by

$$\begin{aligned} \sum _{j=1}^k \frac{1}{j^{1+r}} \mathrm {exp}\left(-C_1p \sum _{i=j+1}^k{\frac{1}{i}} \right)&\le \sum _{j=1}^k \frac{1}{j^{1+r}} \mathrm {exp}\left(-C_1p\ln \left( \frac{k+1}{j+1} \right) \right) \\&\le \sum _{j=1}^k{ \frac{1}{j^{1+r}} \left(\frac{k+1}{j+1}\right)^{-C_1p}} \\&= (k+1)^{-C_1p} \sum _{j=1}^k{ j^{C_1 p-r-1} \cdot \left(\frac{j+1}{j}\right)^{C_1p}} \\&\le \frac{2^{C_1p}}{C_1 p-r} (k+1)^{-r}. \end{aligned}$$

The final inequality is where we needed \(C_1p > r\), in order to have something better than \(j^{-1}\) in the sum. \(\square\)

4 Proof of main theorem

Using the lemmas presented in the previous section, we are now in a position to prove Theorem 1. Compared to the earlier results in the literature, we can provide a more general result with respect to the Lipschitz condition. More precisely, with the help of our a priori bound from Lemma 2, we can exchange the global Lipschitz condition by a local Lipschitz condition.

Proof of Theorem 1

Given the sequence of mutually independent random variables \(\xi ^k\), we abbreviate the random functions \(f_k = f(\cdot , \xi ^k)\) and \(T_{k} = T_{\alpha _k f, \xi^k}\), \(k \in \mathbb {N}\). Then the scheme can be written as \(w^{k+1} = T_k w^{k}\). If \(T_k w^* = w^*\), we would essentially only have to invoke Lemmas 8 and 9 to finish the proof. But due to the stochasticity, this does not hold, so we need to be more careful.

We begin by adding and subtracting the term \(T_k w^*\) and find that

$$\begin{aligned} \Vert w^{k+1} - w^* \Vert ^2&= \Vert T_k w^{k} - T_k w^* \Vert ^2 + 2( T_k w^{k} - T_k w^* , T_k w^* - w^* )_{} \\&\quad + \Vert T_k w^* - w^* \Vert ^2. \end{aligned}$$

By Lemmas 8 and 9 the expectation \(\mathbf {E}_{\xi ^k}\) of the first term on the right-hand side is bounded by \((1 - 2\mu \alpha _k + 3\nu ^2\alpha _k^2)^{\nicefrac {1}{2}}\Vert w^{k} - w^* \Vert ^2\) while by Lemma 6 the last term is bounded in expectation by \(\alpha _k^2 \sigma ^2\). The second term is the problematic one. We add and subtract both \(w^{k}\) and \(w^*\) in order to find terms that we can control:

$$\begin{aligned}&( T_k w^{k} - T_k w^* , T_k w^* - w^* )_{} \\&\quad = \left((T_k - I) w^{k} - (T_k - I)w^*,(T_k - I)w^*\right)_{} + \left( w^{k} - w^*,(T_k - I)w^*\right)_{} \\&\quad =: I_1 + I_2. \end{aligned}$$

In order to bound \(I_1\) and \(I_2\), we first need to apply the a priori bound from Lemma 2. This will also enable us to utilize the local Lipschitz condition. First, we notice that due to Lemma 6, we find that

$$\begin{aligned} \left(\mathbf {E}_{\xi ^k} \left[\Vert T_kw^*\Vert ^j\right] \right)^{\frac{1}{j}} \le \Vert w^*\Vert + \left(\mathbf {E}_{\xi ^k} \left[ \Vert \nabla f_k(w^*)\Vert _{H^*}^j\right] \right)^{\frac{1}{j}} \le \Vert w^*\Vert + \sigma \end{aligned}$$

is bounded for \(j \le 2^m\). As \(T_k\) is a contraction, we also obtain

$$\begin{aligned} \left(\mathbf {E}_k \left[\Vert T_kw^{k}\Vert ^j\right] \right)^{\frac{1}{j}}&\le \left(\mathbf {E}_k \left[\Vert T_kw^{k} - T_kw^*\Vert ^j\right] \right)^{\frac{1}{j}} + \left(\mathbf {E}_{\xi ^k} \left[\Vert T_kw^*\Vert ^j\right] \right)^{\frac{1}{j}}\\&\le \left(\mathbf {E}_{k} \left[\Vert w^{k} - w^*\Vert ^j\right] \right)^{\frac{1}{j}} + \Vert w^*\Vert + \sigma . \end{aligned}$$

Thus, there exists a random variable \(R_1\) such that

$$\begin{aligned} \max \left( \Vert T_k w^{k}\Vert , \Vert T_k w^*\Vert \right) \le R_1, \end{aligned}$$

and \(\mathbf {E}_k [R_1^j]\) is bounded for \(j \le 2^m\). For \(I_1\), we then obtain that

$$\begin{aligned} I_1&\le \left((T_k - I) w^{k} - (T_k - I) w^* ,(T_k - I)w^*\right)_{} \\&\le \Vert \alpha _k \nabla f_k(T_kw^{k}) - \alpha _k\nabla f_k(T_kw^*)\Vert _{H^*} \Vert \alpha _k \nabla f_k(w^*) \Vert _{H^*}\\&\le \alpha _k^2 L_{\xi ^k}(R_1) \Vert T_kw^{k} - T_kw^*\Vert \Vert \nabla f_k(w^*) \Vert _{H^*}\\&\le \alpha _k^2 L_{\xi ^k}(R_1) \Vert w^{k} - w^*\Vert \Vert \nabla f_k(w^*) \Vert _{H^*}, \end{aligned}$$

where we used the fact that \(T_k\) is contractive in the last step. Taking the expectation, we then have by Hölder’s inequality that

$$\begin{aligned} \mathbf {E}_k [I_1 ]&\le \alpha _k^2 \mathbf {E}_k \left[L_{\xi ^k}(R_1) \Vert w^{k} - w^*\Vert \Vert \nabla f_k(w^*) \Vert _{H^*}\right]\\&\le \alpha _k^2 \tilde{L}_1 \left(\mathbf {E}_{k-1} \left[\Vert w^{k} - w^*\Vert ^{2^m} \right]\right)^{2^{-m}} \left(\mathbf {E}_{\xi^k} \left[\Vert \nabla f_k(w^*) \Vert _{H^*}^{2^m} \right]\right)^{2^{-m}}, \end{aligned}$$

where

$$\begin{aligned} \tilde{L}_1 = {\left\{ \begin{array}{ll} \left(\mathbf {E}_k \left[P(R_1)^{{\frac{2^{m}}{2^{m} -2}}} \right]\right)^{{\frac{2^{m} -2}{2^{m}}}}, \quad &{}m > 1,\\ \sup |P(R_1) |, &{}m = 1. \end{array}\right. } \end{aligned}$$

As P is a polynomial of at most order \(2^m -2\), the expression only contains terms \(R_1^j\) where the exponent j is at most \(\left({\frac{2^{m}}{2^{m} -2}}\right) \left(2^m -2\right) = 2^{m}\). Hence \(\tilde{L}_1\) is bounded, and in view of Lemma 2 we get that

$$\begin{aligned} \mathbf {E}_k[I_1] \le D_1 \alpha _k^2, \end{aligned}$$

where \(D_1 \ge 0\) is a constant depending only on \(\Vert w^*\Vert\), \(\Vert w_1 - w^*\Vert\), \(\sigma\) and \(\eta\). For \(I_2\), we add and subtract \(\alpha _k \iota \nabla f_k (w^*)\) to get

$$\begin{aligned} I_2&= \left( w^{k} - w^*,(T_k - I)w^*\right)_{} \\&= \left( w^{k} - w^*,(T_k - I)w^* + \alpha _k \iota \nabla f_k (w^*) \right)_{} - \left( w^{k} - w^*,\alpha _k \iota \nabla f_k (w^*)\right)_{} . \end{aligned}$$

Since \(w^{k} - w^*\) is independent of \(\alpha _k \nabla f_k (w^*)\), it follows that

$$\begin{aligned} \mathbf {E}_{\xi^k} [\left( w^{k} - w^*,\alpha _k \iota \nabla f_k (w^*)\right)_{} ] = \left( w^{k} - w^*, \mathbf {E}_{\xi^k} [\alpha _k \iota \nabla f_k (w^*)]\right)_{} = 0. \end{aligned}$$

Using the Cauchy-Schwarz inequality and Lemma 6, we find that

$$\begin{aligned} \mathbf {E}_k [I_2]&\le \mathbf {E}_k \left[ \Vert w^{k} - w^*\Vert \Vert \iota ^{-1} (T_k - I)w^* + \alpha _k \nabla f_k (w^*)\Vert _{H^*} \right]\\&\le \mathbf {E}_k\left[ L_{\xi ^k}(R_2) \alpha _k^2 \Vert w^{k} - w^*\Vert \Vert \nabla f_k (w^*)\Vert _{H^*} \right] \\&\le \alpha _k^2 \tilde{L}_2 \left(\mathbf {E}_{k-1}\left[\Vert w^{k} - w^*\Vert ^{2^m}\right] \right)^{2^{-m}} \left(\mathbf {E}_{\xi^k} \left[\Vert \nabla f_k(w^*) \Vert _{H^*}^{2^m} \right]\right)^{2^{-m}}, \end{aligned}$$

where \(R_2 = \max (\Vert w^*\Vert ,\Vert \nabla f_k (w^*)\Vert _{H^*})\) and

$$\begin{aligned} \tilde{L}_2 = {\left\{ \begin{array}{ll} \left(\mathbf {E}_k \left[P(R_2)^{{\frac{2^{m}}{2^{m} -2}}} \right]\right)^{{\frac{2^{m} -2}{2^{m}}}}, \quad &{}m > 1,\\ \sup |P(R_2) |, &{}m = 1. \end{array}\right. } \end{aligned}$$

Just as for \(I_1\), we therefore get by Lemma 2 that

$$\begin{aligned} \mathbf {E}_k[I_2] \le D_2 \alpha _k^2, \end{aligned}$$

where \(D_2 \ge 0\) is a constant depending only on \(\Vert w^*\Vert\), \(\Vert w_1 - w^*\Vert\), \(\sigma\) and \(\eta\).

Summarising, we now have

$$\begin{aligned} \mathbf {E}_k\left[ \Vert w^{k+1} - w^* \Vert ^2 \right]&\le \tilde{C}_k \mathbf {E}_{k-1}\left[\Vert w^{k} - w^* \Vert ^2\right] + \alpha _k^2 D \end{aligned}$$

with \(\tilde{C}_k = \left(1 - 2\mu \alpha _k + 3\nu ^2\alpha _k^2 \right)^{\nicefrac {1}{2}}\) and \(D = \sigma ^2 + D_1 + D_2\). Recursively applying the above bound yields

$$\begin{aligned} \mathbf {E}_k\left[ \Vert w^{k+1} - w^* \Vert ^2 \right] \le \prod _{j=1}^k{ \tilde{C}_j \Vert w_1 - w^* \Vert ^2} + D \sum _{j=1}^k{ \alpha _j^2 \prod _{i=j+1}^k{ \tilde{C}_i} }. \end{aligned}$$

Applying Lemma 10 (i) and (ii) with \(p=\nicefrac {1}{2}\), \(r=1\), \(C_1 = 2\mu \eta\) and \(C_2 = 3\nu ^2\eta ^2\) then shows that

$$\begin{aligned} \prod _{j=1}^k{ \tilde{C}_j} \le \mathrm {exp}\left(\frac{\nu ^2\eta ^2 \pi ^2}{4}\right) (k+1)^{-\mu \eta } \end{aligned}$$

and

$$\begin{aligned} \sum _{j=1}^k{ \alpha _j^2 \prod _{i=j+1}^k{ \tilde{C}_i}} \le \eta ^2 2^{\mu \eta } \mathrm {exp}\left(\frac{\nu ^2\eta ^2 \pi ^2}{4}\right) \frac{1}{ \mu \eta - 1} (k+1)^{-1}. \end{aligned}$$

Thus, we finally arrive at

$$\begin{aligned} \mathbf {E}_k\left[ \Vert w^{k+1} - w^*\Vert ^2 \right] \le \frac{C}{k+1}, \end{aligned}$$

where C depends on \(\Vert w^*\Vert\), \(\Vert w_1 - w^*\Vert\), \(\mu\), \(\sigma\) and \(\eta\). \(\square\)

Remark 4

The above proof is complicated mainly due to the stochasticity and due to the lack of strong convexity. We consider briefly the simpler, deterministic, full-batch, case with

$$\begin{aligned} w^{k+1} = w^k - \alpha _k \nabla F(w^{k+1}), \end{aligned}$$

where F is strongly convex with convexity constant \(\mu\). Then it can easily be shown that

$$\begin{aligned} ( \nabla F(v) - \nabla F(w) , v - w )_{} \ge \mu \Vert v - w\Vert ^2. \end{aligned}$$

This means that

$$\begin{aligned} \Vert \left( I + \alpha \nabla F \right)^{-1}(v) - \left( I + \alpha \nabla F \right)^{-1}(w)\Vert \le (1 + \alpha \mu )^{-1} \Vert v - w\Vert , \end{aligned}$$

i.e. the resolvent is a strict contraction. Since \(\nabla F(w^*) = 0\), it follows that \(\left( I + \alpha \nabla F \right)^{-1} w^* = w^*\) so a simple iterative argument shows that

$$\begin{aligned} \Vert w^{k+1} - w^*\Vert ^2 \le \prod _{j=1}^k \left(1 + \alpha _j \mu \right)^{-1} \Vert w_1 - w^*\Vert ^2. \end{aligned}$$

Using \((1 + \alpha \mu )^{-1} \le 1 - \mu \alpha + \mu ^2\alpha ^2\), choosing \(\alpha _k = \eta /k\) and applying Lemma 10 then shows that

$$\begin{aligned} \Vert w^{k+1} - w^*\Vert ^2 \le C (k+1)^{-1} \end{aligned}$$

for appropriately chosen \(\eta\). In particular, these arguments do not require the Lipschitz continuity of \(\nabla F\), which is needed in the stochastic case to handle the terms arising due to \(\nabla f(w^*, \xi ) \ne 0\).

5 Numerical experiments

In order to illustrate our results, we set up a numerical experiment along the lines given in the introduction. In the following, let \(H = L^2(0,1)\) be the Lebesgue space of square integrable functions equipped with the usual inner product and norm. Further, let \(x_j^i \in H\) for \(i = 1\), \(j = 1, \dots , \left\lfloor \frac{n}{2}\right\rfloor\) and \(i = 2\), \(j = \left\lfloor \frac{n}{2}\right\rfloor + 1, \ldots , n\) be elements from two different classes within the space H. In particular, we choose each \(x_j^1\) to be a polynomial of degree 4 and each \(x_j^2\) to be a trigonometric function with bounded frequency for \(j = 1,\dots ,n\). The polynomial coefficients and the frequencies were randomly chosen.

We want to classify these functions as either polynomial or trigonometric. To do this, we set up an affine (SVM-like) classifier by choosing the loss function \(\ell (h,y)= \ln (1 + \mathrm {e}^{-hy})\) and the prediction function \(h( [w,\overline{w}], x) = ( w , x )_{} + \overline{w}\) with \([w,\overline{w}] \in L^2(0,1) \times \mathbb {R}\). Without \(\overline{w}\), this would be linear, but by including \(\overline{w}\) we can allow for a constant bias term and thereby make it affine. We also add a regularization term \(\frac{\lambda }{2} \Vert w\Vert ^2\) (not including the bias), such that the minimization objective is

$$\begin{aligned} F([w,\overline{w}], \xi ) = \frac{1}{n} { \sum _{j = 1}^{n}{\ell (h( [w,\overline{w}], x_j), y_j)} + \frac{\lambda }{2} \Vert w\Vert ^2 }, \end{aligned}$$

where \([x_j, y_j] = [x^1_j, -1]\) if \(j \le \left\lfloor \frac{n}{2} \right\rfloor\) and \([x_j, y_j] = [x^2_j, 1]\) if \(j > \left\lfloor \frac{n}{2}\right\rfloor\), similar to Eq. (2). In one step of SPI, we use the function

$$\begin{aligned} f([w,\overline{w}], \xi ) = \ell (h( [w,\overline{w}], x_{\xi }), y_{\xi }) + \frac{\lambda }{2} \Vert w\Vert ^2 , \end{aligned}$$

with a random variable \(\xi :\Omega \rightarrow \{1,\dots ,n\}\). Since we cannot do computations directly in the infinite-dimensional space, we discretize all the functions using N equidistant points in [0, 1], omitting the endpoints. For each N, this gives us an optimization problem on \(\mathbb {R}^N\), which approximates the problem on H.

For the implementation, we make use of the following computational idea, which makes SPI essentially as fast as SGD. Differentiating the chosen \(\ell\) and h shows that the scheme is given by the iteration

$$\begin{aligned} {[}w,\overline{w}]^{k+1} = [w,\overline{w}]^{k} + c_k [x_k,1] - \lambda \alpha _k [w,0]^{k+1}, \end{aligned}$$

where \(c_k = \frac{\alpha _k y_k }{ 1 + \mathrm {exp}( ( w^{k+1}, \,x_k ) y_k + \overline{w}^{k+1} y_k)}\). This is equivalent to

$$\begin{aligned} w^{k+1} = \frac{1}{1 + \alpha _k \lambda } \left(w^{k} + c_k x_k\right) \quad \text {and} \quad \overline{w}^{k+1} = \overline{w}^{k} + c_k. \end{aligned}$$

Inserting the expression for \([w,\overline{w}]^{k+1}\) in the definition of \(c_k\), we obtain that

$$\begin{aligned} c_k = \frac{\alpha _k y_k}{ 1 + \mathrm {exp}\left( \frac{1}{1 + \alpha _k \lambda } ( w^{k} + c_k x_k , x_k )_{} y_k + (\overline{w}^{k} + c_k) y_k\right) }. \end{aligned}$$

We thus only need to solve one scalar-valued equation. This is at most twice as expensive as SGD, since the equation solving is essentially free and the only additional costly term is \(( x_k , x_k )_{}\) (the term \(( w^k , x_k )_{}\) of course has to be computed also in SGD). By storing the scalar result, the extra cost will be essentially zero if the same sample is revisited. We note that extending this approach to larger batch-sizes is straightforward. If the batch size is B, then one has to solve a B-dimensional equation.

Using this idea, we implemented the method in Python and tested it on a series of different discretizations. We took \(n = 1000\), i.e. 500 functions of each type, \(M = 10{,}000\) time steps and discretization parameters \(N = 100 \cdot 2^i\) for \(i = 1, \ldots , 11\) to approximate the infinite dimensional space \(L^2(0,1)\). We used \(\lambda = 10^{-3}\) and the initial step size \(\eta = \nicefrac{2}{\lambda }\), since in this case it can be shown that \(\mu \ge \lambda\). There is no closed-form expression for the exact minimum \(w^*\), so instead we ran SPI with 10M time steps and used the resulting reference solution as an approximation to \(w^*\). Further, we approximated the expectation \(\mathbf {E}_k\) by running the experiment 100 times and averaging the resulting errors. In order to compensate for the vectors becoming longer as N increases, we measure the errors in the RMS-norm \(\Vert \cdot \Vert _N = \Vert \cdot \Vert _{\mathbb {R}^N} / \sqrt{N+1}\). As \(N \rightarrow \infty\), this tends to the \(L^2\) norm.

Figure 1 shows the resulting approximated errors \(\mathbf {E}_{k-1}[\Vert w^{k} - w^*\Vert _N^2]\). As expected, we observe convergence proportional to \(\nicefrac {1}{k}\) for all N. The error constants do vary to a certain extent, but they are reasonably similar. As the problem approaches the infinite-dimensional case, they vary less. In order to decrease the computational requirements, we only compute statistics at every 100 time steps, this is why the plot starts at \(k = 100\).

Fig. 1
figure 1

Approximated errors \(\mathbf {E}_{k-1}[\Vert w^{k} - w^*\Vert _N^2]\) for the SPI method, measured in RMS-norm, for discretizations with varying number of grid points N. Statistics were only computed at every 100 time steps, this is why the plot starts at \(k = 100\). The 1/k-convergence is clearly seen by comparing to the uppermost solid black reference line

In contrast, redoing the same experiment but with the explicit SGD method instead results in Fig. 2. We note that except for \(N = 200\) and \(N=400\), the method seemingly does not converge at all. This is partially explained by the fact that the Lipschitz constant grows with N (at least for the coarsest discretizations, for which we could estimate it), such that we get closer to the stability boundary. The main reason, however, is because of rare “bad” paths. In those, the method initially takes a large step in the wrong direction. Theoretically, it will eventually recover from this. In practice, it does not, due to the finite computational budget. Even when such bad paths are omitted from the results and \(\mathcal {O}(1/k)-\)convergence is observed, the errors are much larger than in Fig. 1. Many more steps would be necessary to reach the same accuracy as SPI. Since our implementations are certainly not optimal in any sense, we do not show a comparison of computational times here. They are, however, very similar, meaning that SPI is more efficient than SGD for this problem.

Fig. 2
figure 2

Approximated errors \(\mathbf {E}_{k-1}[\Vert w^{k} - w^*\Vert _N^2]\) for the SGD method, measured in RMS-norm, for discretizations with varying number of grid points N. Statistics were only computed at every 100 time steps, this is why the plot starts at \(k = 100\). Except for \(N = 200\) and \(N=400\), the method does not converge at all. Even when it does, the errors are much larger than in Fig. 1

6 Conclusions

We have rigorously proved convergence with an optimal rate for the stochastic proximal iteration method in a general Hilbert space. This improves the analysis situation in two ways. Firstly, by providing an extension of similar results in a finite-dimensional setting to the infinite-dimensional case, as well as extending these to more general operators. Secondly, by improving on similar infinite-dimensional results that only achieve convergence, without any error bounds. The latter improvement comes at the cost of stronger assumptions on the cost functional. Global Lipschitz continuity of the gradient is, admittedly, a rather strong assumption. However, as we have demonstrated, this can be replaced by local Lipschitz continuity where the maximal growth of the Lipschitz constant is determined by higher moments of the gradient applied to the minimum. This is a weaker condition. Finally, we have seen that the theoretical results are applicable also in practice, as demonstrated by the numerical results in the previous section.