Abstract
We consider a stochastic version of the proximal point algorithm for convex optimization problems posed on a Hilbert space. A typical application of this is supervised learning. While the method is not new, it has not been extensively analyzed in this form. Indeed, most related results are confined to the finite-dimensional setting, where error bounds could depend on the dimension of the space. On the other hand, the few existing results in the infinite-dimensional setting only prove very weak types of convergence, owing to weak assumptions on the problem. In particular, there are no results that show strong convergence with a rate. In this article, we bridge these two worlds by assuming more regularity of the optimization problem, which allows us to prove convergence with an (optimal) sub-linear rate also in an infinite-dimensional setting. In particular, we assume that the objective function is the expected value of a family of convex differentiable functions. While we require that the full objective function is strongly convex, we do not assume that its constituent parts are so. Further, we require that the gradient satisfies a weak local Lipschitz continuity property, where the Lipschitz constant may grow polynomially given certain guarantees on the variance and higher moments near the minimum. We illustrate these results by discretizing a concrete infinite-dimensional classification problem with varying degrees of accuracy.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
We consider convex optimization problems of the form
where H is a real Hilbert space and
The main applications we have in mind are supervised learning tasks. In such a problem, a set of data samples \(\{x_j\}_{j=1}^{n}\) with corresponding labels \(\{y_j\}_{j=1}^{n}\) is given, as well as a classifier h depending on the parameters w. The goal is to find w such that \(h(w,x_j) \approx y_j\) for all \(j \in \{1,\dots ,n\}\). This is done by minimizing
where \(\ell\) is a given loss function. We refer to, e.g., Bottou et al. [9] for an overview. In order to reduce the computational costs, it has been proved to be useful to split F into a collection of functions f of the type
where \(B_{\xi }\) is a random subset of \(\{1,\dots ,n\}\), referred to as a batch. In particular, the case of \(|B_{\xi } |= 1\) is interesting for applications, as it corresponds to a separation of the data into single samples.
A commonly used method for such problems is the stochastic gradient method (SGD), given by the iteration
where \(\alpha _k >0\) denotes a step size, \(\{\xi ^k\}_{k \in \mathbb {N}}\) is a family of jointly independent random variables and \(\nabla\) denotes the Gâteaux derivative with respect to the first variable. The idea is that in each step we choose a random part \(f(\cdot , \xi )\) of F and go in the direction of the negative gradient of this function. SGD corresponds to a stochastic version of the explicit (forward) Euler scheme applied to the gradient flow
This differential equation is frequently stiff, which means that the method often suffers from stability issues.
The restatement of the problem as a gradient flow suggests that we could avoid such stability problems by instead considering a stochastic version of implicit (backward) Euler, given by
In the deterministic setting, this method has a long history under the name proximal point method, because it is equivalent to
where
The proximal point method has been studied extensively in the infinite dimensional but deterministic case, beginning with the work of Rockafellar [28]. Several convergence results and connections to other methods such as the Douglas–Rachford splitting are collected in Eckstein and Bertsekas [13], see also Güler [17]. In the strongly convex case, the main convergence analysis idea is to observe that the gradient is strongly monotone. Then the resolvent \((I + \alpha \nabla F)^{-1}\) is a strict contraction, and the Banach fixed point theorem shows that \(\{w^k\}_{k \in \mathbb {N}}\) converges to \(w^*\) in norm.
Following Ryu and Boyd [32], we will refer to the stochastic version as stochastic proximal iteration (SPI). We note that the computational cost of one SPI step is in general much higher than for SGD, and indeed often infeasible. However, in many special cases a clever reformulation can result in very similar costs. If so, then SPI should be preferred over SGD, as it will converge more reliably. We provide such an example in Sect. 5.
The main goal of this paper is to prove sub-linear convergence of the type
in an infinite-dimensional setting, i.e. where \(\{w^k\}_{k \in \mathbb {N}}\) and \(w^*\) are elements in a Hilbert space H. As shown in e.g. [1, 26], this is optimal in the sense that we cannot expect a better asymptotic rate even in the finite-dimensional case.
Most previous convergence results in this setting only provide guarantees for convergence, without an explicit error bound. The convergence is usually also in a rather weak norm. This is mainly due to weak assumptions on the involved functions and operators. Overall, little work has been done to consider SPI in an infinite dimensional space. A few exceptions are given by Bianchi [7], where maximal monotone operators \(\nabla F :H \rightarrow 2^H\) are considered and weak ergodic convergence and norm convergence is proved. In Rosasco et al. [30], the authors work with an infinite dimensional setting and an implicit-explicit splitting where \(\nabla F\) is decomposed in a regular and an irregular part. The regular part is considered explicitly but with a stochastic approximation while the irregular part is used in a deterministic proximal step. They prove both \(\nabla F(w^k) \rightarrow \nabla F(w^*)\) and \(w^k \rightarrow w^*\) in H as \(k \rightarrow \infty\). Without further assumptions, neither of these approaches yield convergence rates.
In the finite-dimensional case, stronger assumptions are typically made, with better convergence guarantees as a result. Nevertheless, for the SPI scheme in particular, we are only aware of the unpublished manuscript [32], which suggests \(\nicefrac {1}{k}\) convergence in \(\mathbb {R}^d\). Based on [32], the implicit method has also been considered in a few other works: In Patrascu and Necoara [24], a SPI method with additional constraints on the domain was studied. A slightly more general setting that includes the SPI has been considered in Davis and Drusvyatskiy [12]. Toulis and Airoldi and Toulis et al. studied such an implicit scheme in [35,36,37]. Finally, very recently and during the preparation of this work, [20] was published, wherein both SGD and proximal methods for composite problems are analyzed in a common framework based on bounded gradients. This is a generalization of the basic setting in a different direction than our work.
Whenever using an implicit scheme, it is essential to solve the appearing implicit equation effectively. This can be impeded by large batches for the stochastic approximation of F. On the other hand, a larger batch improves the accuracy of the approximation of the function. In Toulis et al. [39, 40] and Ryu and Yin [33], a compromise was found by solving several implicit problems on small batches and taking the average of these results. This corresponds to a sum splitting. Furthermore, implicit-explicit splittings can be found in Patrascu and Irofti [23], Ryu and Yin [33], Salim et al. [34], Bianchi and Hachem [8] and Bertsekas [6]. A few more related schemes have been considered in Asi and Duchi [2, 3] and Toulis et al. [38]. More information about the complexity of solving these kinds of implicit equations and the corresponding implementation can be found in Fagan and Iyengar [16] and Tran et al. in [40].
Our aim is to bridge the gap between the “strong finite-dimensional” and “weak infinite-dimensional” settings, by extending the approach of [32] to the infinite-dimensional case. We also further extend the results by allowing for more general Lipschitz conditions on \(\nabla f(\cdot ,\xi )\), provided that sufficient guarantees can be made on the integrability near the minimum \(w^*\). In particular, we make the less restrictive assumption that for every function \(f(\cdot , \xi )\) and every ball of radius \(R>0\) around the origin there is a Lipschitz constant \(L_{\xi }(R)\) that grows polynomially with R. We also weaken the standard assumption of strong convexity and only demand that the functions are strongly convex for some realizations.
We note that if F is only convex then there might be multiple local minima, and proving convergence in norm is in general not possible. On the other hand, if every \(f(\cdot , \xi )\) is strongly convex then parts of the analysis can be simplified. The assumptions made in this article are thus situated between these two extremes, where it is still possible to prove convergence results similar to the strongly convex case but under milder assumptions.
These strong convergence results can then be applied to, e.g., the setting where there is an original infinite-dimensional optimization problem which is subsequently discretized into a series of finite-dimensional problems. Given a reasonable discretization, each of those problems will then satisfy the same convergence guarantees.
Our analysis closely follows the finite-dimensional approach [32]. However, several arguments no longer work in the infinite-dimensional case (such as the unit ball being compact, or a linear operator having a minimal eigenvalue) and we fix those. Additionally, we simplify several of the remaining arguments, provide many omitted, but critical, details and extend the results to more general operators.
A brief outline of the paper is as follows. The main assumptions that we make are stated in Sect. 2, as well as the main theorem. Then we prove a number of preliminary results in Sect. 3, before we can tackle the main proof in Sect. 4. In Sect. 5 we describe a numerical experiment that illustrates our results, and then we summarize our findings in Sect. 6.
2 Assumptions and main theorem
Let \((\Omega , \mathcal {F}, \mathbf {P})\) be a complete probability space and let \(\{\xi ^k\}_{k \in \mathbb {N}}\) be a family of jointly independent random variables on \(\Omega\). Each realization of \(\xi ^k\) corresponds to a different batch. Let \((H, ( \cdot , \cdot )_{}, \Vert \cdot \Vert )\) be a real Hilbert space and \((H^*, ( \cdot , \cdot )_{H^*}, \Vert \cdot \Vert _{H^*} )\) its dual. Since H is a Hilbert space, there exists an isometric isomorphism \(\iota :H^* \rightarrow H\) such that \(\iota ^{-1} :H \rightarrow H^*\) with \(\iota ^{-1}: u \mapsto ( u , \cdot )_{}\). Furthermore, the dual pairing is denoted by \(\langle u' , u\rangle _{} = u'(u)\) for \(u' \in H^*\) and \(u \in H\). It satisfies
We denote the space of linear bounded operators mapping H into H by \(\mathcal {L}(H)\). For a symmetric operator S, we say that it is positive if \(( Su , u )_{} \ge 0\) for all \(u \in H\). It is called strictly positive if \(( Su , u )_{} > 0\) for all \(u \in H\) such that \(u \ne 0\).
For the function \(f(\cdot , \xi ) :H \times \Omega \rightarrow (-\infty , \infty ]\), we use \(\nabla\), as in \(\nabla f(u, \xi )\), to denote differentiation with respect to the first variable. When we present an argument that holds almost surely, we will frequently omit \(\xi\) from the notation and simply write f(u) rather than \(f(u, \xi )\). Given a random variable X on \(\Omega\), we denote the expectation with respect to \(\mathbf {P}\) by \(\mathbf {E}[X]\). We use sub-indices, such as in \(\mathbf {E}_{\xi }[\cdot ]\), to denote expectations with respect to the probability distribution of the random variable \(\xi\).
We consider the stochastic proximal iteration (SPI) scheme given by
for minimizing
where f and F fulfill the following assumption.
For the family of jointly independent random variables \(\{\xi ^k\}_{k \in \mathbb {N}}\), we are interested in the total expectation
Since the random variables \(\{\xi ^k\}_{k \in \mathbb {N}}\) are jointly independent, and \(w^k\) only depends on \(\xi ^j\), \(j \le k-1\), this expectation coincides with the expectation with respect to the joint probability distribution of \(\xi ^1, \ldots , \xi ^{k-1}\). In the rest of the paper, it often occurs that a statement does not involve an expectation but contains a random variable. Where it does not cause any confusion, such a statement is assumed to hold almost surely even if this is not explicitly stated.
Assumption 1
For a random variable \(\xi\) on \(\Omega\), let the function \(f(\cdot , \xi ) :\Omega \times H \rightarrow (-\infty , \infty ]\) be given such that \(\omega \mapsto f(v, \xi (\omega ))\) is measurable for every \(v \in H\) and such that \(f(\cdot , \xi )\) is convex, lower semi-continuous and proper almost surely. Additionally, \(f(\cdot , \xi )\) fulfills the following conditions:
-
The expectation \(\mathbf {E}_{\xi }\left[f(\cdot , \xi )\right] =: F(\cdot )\) is lower semi-continuous and proper.
-
The function \(f(\cdot , \xi )\) is Gâteaux differentiable almost surely on a non-empty common domain \(\mathcal {D}\left( \nabla f \right) \subseteq H\), i.e. for all for all \(v,w \in \mathcal {D}\left( \nabla f \right)\) the inequality \(\langle \iota \nabla f (v, \xi ) , w\rangle _{} = \lim _{h \rightarrow 0} \frac{f(v + hw, \xi ) - f(v, \xi )}{h}\) is fulfilled almost surely.
-
There exists \(m \in \mathbb {N}\) such that \(\left(\mathbf{E}_{\xi }\left[ \Vert \nabla f(w^*,\xi )\Vert _{H^*}^{2^m} \right]\right)^{2^{-m}} =: \sigma < \infty\).
-
For every \(R > 0\) there exists \(L_{\xi }(R) :\Omega \rightarrow \mathbb {R}\) such that
$$\begin{aligned} \Vert \nabla f(u, \xi ) - \nabla f(v, \xi ) \Vert _{H^*} \le L_{\xi }(R) \Vert u - v\Vert \end{aligned}$$almost surely for all \(u, v \in \mathcal {D}\left( \nabla f \right)\) with \(\Vert u \Vert , \Vert v\Vert \le R\). Furthermore, there exists a polynomial \(P :\mathbb {R}\rightarrow \mathbb {R}\) of degree \(2^m -2\) such that \(L_{\xi }(R) \le P(R)\) almost surely.
-
There exist a random variable \(M_{\xi } :\Omega \rightarrow \mathcal {L}(H)\) such that the image is symmetric and a random variable \(\mu _{\xi } :\Omega \rightarrow [0, \infty )\) such that \(\mathbf {E}_{\xi }[\mu _{\xi }] = \mu > 0\) and \(\mathbf {E}_{\xi }[\mu _{\xi }^2] = \nu ^2 < \infty\). Moreover,
$$\begin{aligned} \langle \nabla f(u, \xi ) - \nabla f(v, \xi ) , u - v\rangle _{} \ge ( M_{\xi }(u - v) , u - v )_{} \ge \mu _{\xi } \Vert u - v\Vert ^2 \end{aligned}$$is fulfilled almost surely for all \(u,v \in \mathcal {D}\left( \nabla f \right)\).
An immediate result of Assumption 1, is that the gradient \(\nabla f(\cdot , \xi )\) is maximal monotone almost surely, see [27, Theorem A]. As a consequence, the resolvent (proximal operator)
is well-defined almost surely, see Lemma 1 for more details. Further, each resolvent maps into \(\mathcal {D}\left( \nabla f \right)\), and as a consequence every iterate \(w^k \in \mathcal {D}\left( \nabla f \right)\). Finally, we may interchange expectation and differentiation so that \(\nabla F(w) = \mathbf {E}_{\xi }[\nabla f(\xi , w)]\). Note that this means that the approximation \(\nabla f(\cdot , \xi )\) is an unbiased estimate of the full gradient \(\nabla F\). In our case, this property can be shown via a straightforward argument based on dominated convergence similar to [32, Lemma 6], but we note that it also holds in more general settings [21, 29].
Remark 1
The idea behind the operators \(M_{\xi }\) is that each \(f(\cdot , \xi )\) is is allowed to be only convex rather than strongly convex. However, they should be strongly convex for some realizations, such that \(f(\cdot , \xi )\) is strongly convex in expectation. By assumption, F is lower semi-continuous, proper and strongly convex, so there is a minimum \(w^*\) of (1) (c.f. [4, Proposition 1.4]) which is unique due to the strong convexity.
Remark 2
Note that the local Lipschitz constant of Assumption 1 is a generalization compared to [32] and other existing literature. Instead of asking for one Lipschitz constant \(L_{\xi }\) that is valid on the entire domain, we only ask for a Lipschitz constant \(L_{\xi }(R)\) that depends on the norm of the input elements \(u, v \in \mathcal {D}(\nabla f)\). This means in particular that \(L_{\xi }(R)\) may tend to infinity as \(R \rightarrow \infty\). In the coming analysis we handle this by applying an a priori bound (Lemma 2) that shows that the solution is bounded and thus R is bounded too.
While the properness of F needs to be verified by application-specific means, the lower semi-continuity can be guaranteed on a more general level in different ways. If, e.g., it is additionally known that \(\mathbf {E}_{\xi } \left[\inf _{u \in H} f(u, \xi ) \right] > -\infty\) then one can employ Fatou’s lemma ([22, Theorem 2.3.6]) as in [32, Lemma 5], or slightly modify [5, Corollary 9.4].
We note that from a function analytic point of view, we are dealing with bounded rather than unbounded operators \(\nabla F\). However, also operators that are traditionally seen as unbounded fit into the framework, given that the space H is chosen properly. For example, the functional \(F(w) = \frac{1}{2}\int {\Vert \nabla w\Vert ^2}\) corresponding to \(\nabla F = - \Delta\), the negative Laplacian, is unbounded on \(H = L^2\). But if we instead choose \(H = H^1_0\), then \(H^* = H^{-1}\) and \(\nabla F\) is bounded and Lipschitz continuous. In this case, the splitting of F(w) into \(f(w, \xi ^k)\) is less obvious than in our main application, but e.g. (randomized) domain decomposition as in [25] is a natural idea. In each step, an elliptic problem then has to be solved (to apply \(\iota\)), but this can often be done very efficiently.
Our main theorem states that we have sub-linear convergence of the iterates \(w^k\) to \(w^*\) in expectation:
Theorem 1
Let Assumption 1 be fulfilled and let \(\{\xi ^k\}_{k \in \mathbb {N}}\) be a family of jointly independent random variables on \(\Omega\). Then the scheme (3) converges sub-linearly if the step sizes fulfill \(\alpha _k = \frac{\eta }{k}\) with \(\eta > \frac{1}{\mu }\). In particular, the error bound
is fulfilled, where C depends on \(\Vert w_1 - w^* \Vert\), \(\mu\), \(\nu\), \(\sigma\), \(\eta\) and m.
When \(m=1\), there is a L such that \(L_{\xi }(R) \le L\) almost surely for all R and we have the explicit bound
For details on the error constant when \(m > 1\), we refer the reader to the proof, which is given in Sect. 4. We note that there is no upper bound on the step size \(\alpha _k\), as would be the case for an explicit method like SGD. There is still a lower bound, but this is not as critical. Similarly to the finite-dimensional case (see e.g. [32, Theorem 15]), the method still converges if the assumption \(\eta > \frac{1}{\mu }\) is not fulfilled, albeit at a slower rate \(\mathcal {O}(1/k^\gamma )\) with \(\gamma < 1\). This follows from a straightforward extension of Lemma 10 and the above theorem, but we omit these details for brevity. Moreover, we note that the exponential terms in the error constant are an artifact of the proof. They are not observed in practice and could likely be removed by the use of more refined algebraic inequalities.
The main idea of the proof is to acquire a contraction property of the form
where \(C_k < 1\) and D are certain constants depending on the data. Inevitably, \(C_k \rightarrow 1\) as \(k \rightarrow \infty\), but because of the chosen step size sequence this happens slowly enough to still guarantee the optimal rate. To reach this point, we first show two things: First, an a priori bound of the form \(\mathbf {E}_{k-1}\left[ \Vert w^k - w^* \Vert ^2 \right] \le C\), i.e. unlike the SGD, the SPI is always stable regardless of how large the step size is. Secondly, that the resolvents \(T_{f, \xi }\) are contractive with
Similarly to [32], we do the latter by approximating the functions \(f(\cdot , \xi )\) by convex quadratic functions \(\tilde{f}(\cdot , \xi )\) for which the property is easier to verify, and then establishing a relation between the approximated and the true contraction factors. The series of lemmas in the next section is devoted to this preparatory work.
3 Preliminaries
First, let us show that the scheme is in fact well-defined, in the sense that every iterate is measurable if the random variables \(\{\xi ^k\}_{k \in \mathbb {N}}\) are.
Lemma 1
Let Assumption 1 be fulfilled. Further, let \(\{\xi ^k\}_{k \in \mathbb {N}}\) be a family of jointly independent random variables. Then for every \(k \in \mathbb {N}\) there exists a unique mapping \(w^{k+1} :\Omega \rightarrow \mathcal {D}\left( \nabla f \right)\) that fulfills (3) and is measurable with respect to the \(\sigma\)-algebra generated by \(\xi ^1, \ldots , \xi ^k\).
Proof
We define the mapping
For almost all \(\omega \in \Omega\), the mapping \(f(\cdot , \xi ^k(\omega ))\) is lower semi-continuous, proper and convex. Thus, by [27, Theorem A] \(\nabla f(\cdot , \xi ^k(\omega ))\) is maximal monotone. By [4, Theorem 2.2], this shows that the operator \(\iota ^{-1} + \alpha _k \nabla f(\cdot , \xi ^k(\omega )) :\mathcal {D}\left( \nabla f \right) \rightarrow H^*\) is surjective. Note that the two previously cited results are stated for multi-valued operators. As we are in a more regular setting, the sub-differential of \(f(\cdot , \xi ^k(\omega ))\) only consists of a single element at each point. Therefore, it is possible to apply these multi-valued results also in our setting and interpret the appearing operators as single-valued. Furthermore, due to the monotonicity of \(\nabla f(\cdot , \xi ^k(\omega ))\) it follows that for \(u,v \in \mathcal {D}\left( \nabla f \right)\)
which implies
This verifies that \(I + \alpha _k \iota \nabla f(\cdot , \xi ^k(\omega ))\) is injective. As we have proved that the operator is both injective and surjective, it is, in particular, bijective. Therefore, there exists a unique element \(w^{k+1}(\omega )\) such that
We can now apply [14, Lemma 2.1.4] or [15, Lemma 4.3] and obtain that \(\omega \mapsto w^{k+1}(\omega )\) is measurable. \(\square\)
Proving that the scheme is always stable is relatively straightforward, as shown in the next lemma. With some extra effort, we also get stability in stronger norms, i.e. we can bound not only \(\mathbf {E}_{k}\left[\Vert w^{k+1} - w^*\Vert ^2\right]\) but also higher moments \(\mathbf {E}_{k}\left[\Vert w^{k+1} - w^*\Vert ^{2^m}\right]\), \(m \in \mathbb {N}\). This will be important since we only have the weaker local Lipschitz continuity stated in Assumption 1 rather than global Lipschitz continuity. The idea of the proof stems from a standard technique mostly applied in the field of evolution equations in a variational framework, compare for example [31, Lemma 8.6]. The main difficulty is to incorporate the stochastic gradient in the presentation.
Lemma 2
Let Assumption 1 be fulfilled, and suppose that \(\sum _{k=1}^{\infty }{\alpha _k^2} < \infty\). Then there exists a constant \(D \ge 0\) depending only on \(\Vert w_1 - w^*\Vert\), \(\sum _{k=1}^{\infty }{\alpha _k^2}\) and \(\sigma\), such that
for all \(k \in \mathbb {N}\).
Proof
Within the proof, we abbreviate the function \(f(\cdot , \xi ^k)\) by \(f_k\), \(k \in \mathbb {N}\). First, we consider the case \(m = 1\). Recall the identity \(( a - b , a )_{} = \frac{1}{2} \left(\Vert a\Vert ^2 - \Vert b\Vert ^2 + \Vert a-b\Vert ^2\right)\), \(a,b \in H\). We write the scheme as
subtract \(\alpha _k \iota \nabla f_k(w^{*})\) from both sides, multiply by two and test it with \(w^{k+1} - w^*\) to obtain
For the right-hand side, we have by Young’s inequality that
Together with the monotonicity condition, it then follows that
Since \(w^k - w^*\) is independent of \(\xi ^k\) and \(\mathbf {E}_{\xi ^k}[\nabla f_k(w^*)] = \nabla F(w^*) = 0\), taking the expectation \(\mathbf {E}_{\xi ^k}\) thus leads to the following bound:
Repeating this argument, we obtain that
In order to find the higher moment bound, we recall (4). We then follow a similar idea as in [10, Lemma 3.1], where we multiply this inequality with \(\Vert w^{k+1} - w^*\Vert ^2\) and use the identity \((a - b)a = \frac{1}{2} \left(|a |^2 - |b |^2 + |a-b |^2\right)\) for \(a,b \in \mathbb {R}\). It then follows that
Applying Young’s inequality to the first and fourth term of the previous row then implies that
Summing up from \(j=1\) to k and taking the expectation \(\mathbf {E}_{k}\), yields
We then apply the discrete Grönwall inequality for sums (see, e.g., [11]) which shows that
For the next higher bound \(\mathbf {E}_k \left[\Vert w^{k+1} - w^*\Vert ^8\right]\), we recall that
which we can multiply with \(\Vert w^{k+1} - w^*\Vert ^4\) in order to follow the same strategy as before. Following this approach, we find bounds for \(\mathbf {E}_k \left[\Vert w^{k+1} - w^*\Vert ^{2^m}\right]\) recursively for all \(m \in \mathbb {N}\). \(\square\)
Remark 3
In particular, Lemma 2 implies that there exists a constant D depending on \(\Vert w_1 - w^*\Vert\), \(\sum _{k=1}^{\infty }{\alpha _k^2}\) and \(\sigma\) such that
for all \(p \le 2^m\) and \(k \in \mathbb {N}\). Further, comparing (5)
to the corresponding bound for the SGD
indicates that the SPI has a smaller a priori bound than the SGD. This bound plays a crucial part in the error constant in the convergence proof of Theorem 1. In practice one would expect the terms \(\mathbf {E}_{\xi ^i}{\left[\Vert \nabla f(w^*,\xi ^i) \Vert _{}^2 \right]}\) to be significantly smaller than \(\mathbf {E}_{i}{\left[\Vert \nabla f_i(w^i,\xi ^i) \Vert _{}^2 \right]}\) if the variance of \(\nabla f(\cdot , \xi ^i)\) is small. Note that since we assume that we have an unbiased estimate, the variance is given by \(\mathbf {E}_{\xi ^i}\left[{\Vert \nabla f(w,\xi ^i) \Vert _{}}^2\right] -\Vert \mathbf {E}_{\xi ^i}\left[{\nabla f(w, \xi ^i)}\right] \Vert _{}^2 = \mathbf {E}_{\xi ^i}\left[{\Vert \nabla f(w,\xi ^i) \Vert _{}}^2\right]\).
Following Ryu and Boyd [32], we now introduce the function \(\tilde{f}(\cdot ,\xi ) :H \times \Omega \rightarrow (-\infty ,\infty ]\) given by
where \(u_0 \in \mathcal {D}\left( \nabla f \right)\) is a fixed parameter. This mapping is a convex approximation of f. Furthermore, we define the function \(\tilde{r}(\cdot ,\xi ) :H \times \Omega \rightarrow (-\infty ,\infty ]\) given by
Their gradients \(\nabla \tilde{f}(\cdot ,\xi ) :H \times \Omega \rightarrow H^*\) and \(\nabla \tilde{r}(\cdot ,\xi ) :\mathcal {D}\left( \nabla f \right) \times \Omega \rightarrow H^*\) can be stated as
almost surely. In the following lemma, we collect some standard properties of these operators.
Lemma 3
The function \(\tilde{r}(\cdot , \xi )\) defined in (7) is convex almost surely, i.e., it fulfills \(\tilde{r}(u,\xi ) \ge \tilde{r}(v,\xi ) + \langle \nabla \tilde{r}(v,\xi ) , u - v\rangle _{}\) for all \(u,v \in \mathcal {D}\left( \nabla f \right)\) almost surely. As a consequence, the gradient \(\nabla \tilde{r}(\cdot ,\xi )\) is monotone almost surely.
Proof
In the following proof, let us omit \(\xi\) for simplicity and let \(u,v \in \mathcal {D}\left( \nabla f \right)\) be given. Due to the monotonicity property of \(\nabla f\) stated in Assumption 1, it follows that
For the function \(\tilde{f}\) we can write
All further derivatives are zero. Thus, we can use a Taylor expansion around v to write
It then follows that
By [41, Proposition 25.10], it follows that \(\nabla \tilde{r}\) is monotone. \(\square\)
The following lemma demonstrates that the resolvents \(T_{\tilde{f}, \xi }\) and certain perturbations of them are well-defined. Furthermore, we will provide a more explicit formula for such resolvents. A comparable result is mentioned in [32, page 10], we include a proof for the sake of completeness.
Lemma 4
Let Assumption 1 be fulfilled and let \(\tilde{f}(\cdot , \xi )\) be defined as in (6). Then the operator
is well-defined. If a function \(r(\cdot , \xi ) :H \times \Omega \rightarrow (-\infty , \infty ]\) is Gâteaux differentiable with the common domain \(\mathcal {D}\left( \nabla r \right) = \mathcal {D}\left( \nabla f \right)\), lower semi-continuous, convex and proper almost surely, then
is well-defined.
If there exist \(Q_{\xi } :\mathcal {D}\left( \nabla f \right) \times \Omega \rightarrow H^*\) and \(z_{\xi } :\Omega \rightarrow H^*\) such that \(\nabla r (u,\xi ) = Q_{\xi } u + z_{\xi }\) then the resolvent can be represented by
Proof
For simplicity, let us omit \(\xi\) again. In order to prove that \(T_{\tilde{f}}\) and \(T_{\tilde{f} + r}\) are well-defined, we can apply [27, Theorem A] and [4, Theorem 2.2] analogously to the argumentation in the proof of Lemma 1.
Assuming that \(\nabla r (u) = Q u + z\), we find an explicit representation for \(T_{\tilde{f} + r}\). To this end, for \(v \in H\), consider
Then it follows that
Rearranging the terms, yields
\(\square\)
Next, we will show that the contraction factors of \(T_{f, \xi }\) and \(T_{\tilde{f}, \xi }\) are related. For this, we need the following basic identities and some stronger inequalities that hold for symmetric positive operators on H. These results are fairly standard and similar statements can be found in [32, Lemma 9 and Lemma 10]. For the sake of completeness, we provide an alternative proof that is better adapted to our notation.
Lemma 5
Let Assumption 1 be satisfied and let \(\tilde{f}(\cdot , \xi )\) and \(\tilde{r}(\cdot , \xi )\) be given as in (6) and (7), respectively. Then the identities
are fulfilled almost surely.
Proof
By the definition of \(T_{f,\xi }\), we have that
from which the first claim follows immediately. The second identity then follows from
\(\square\)
As a consequence of Lemma 5 we have the following basic inequalities:
Lemma 6
Let Assumption 1 be satisfied. It then follows that
almost surely for every \(u \in \mathcal {D}\left( \nabla f \right)\). Additionally, if for \(R >0\) the bound \(\Vert u\Vert + \Vert \nabla f(u,\xi )\Vert \le R\) holds true almost surely, then
is fulfilled almost surely.
Proof
In order to shorten the notation, we omit the \(\xi\) in the following proof and let u be in \(\mathcal {D}\left( \nabla f \right)\). For the first inequality, we note that since \(\nabla f\) is monotone, we have
Thus, by the first identity in Lemma 5, it follows that
But by the Cauchy-Schwarz inequality, we also have
which in combination with the previous inequality proves the first claim.
The second inequality follows from the first part of this lemma. Because
both u and \(T_f u\) are in a ball of radius R. Thus, we obtain
Lemma 7
Let \(Q, S \in \mathcal {L}(H)\) be symmetric operators. Then the following holds:
-
If Q is invertible and S and \(Q^{-1}\) are strictly positive, then \((Q + S)^{-1} < Q^{-1}\). If S is only positive, then \((Q + S)^{-1} \le Q^{-1}\).
-
If Q is a positive and contractive operator, i.e. \(\Vert Qu\Vert \le \Vert u\Vert\) for all \(u \in H\), then it follows that \(\Vert Qu\Vert ^2 \le ( Qu , u )_{}\) for all \(u \in H\).
-
If Q is a strongly positive invertible operator, such that there exists \(\beta > 0\) with \(( Q u , u )_{} \ge \beta \Vert u\Vert ^2\) for all \(u \in H\), then \(\Vert Q u \Vert \ge \beta \Vert u\Vert\) for all \(u \in H\) and \(\Vert Q^{-1}\Vert _{\mathcal {L}(H)} \le \frac{1}{\beta }\).
Proof
We start by expressing \((Q + S)^{-1}\) in terms of \(Q^{-1}\) and S, similar to the Sherman-Morrison-Woodbury formula for matrices [18]. First observe that the operator \((I + Q^{-1}S)^{-1} \in \mathcal {L}(H)\) by e.g. [19, Lemma 2A.1]. Then, since
and
we find that
Since \(Q^{-1}\) is symmetric, we see that \((Q + S)^{-1} < Q^{-1}\) if and only if \(S\left(I + Q^{-1}S \right)^{-1}\) is strictly positive. But this is true, as we see from the change of variables \(z = (I + Q^{-1}S)^{-1} u\). Because then
for any \(u \in H\), \(u \ne 0\), since S and \(Q^{-1}\) are strictly positive. If S is only positive, it follows analogously that \(\left(S\left(I + Q^{-1}S \right)^{-1} u,u\right)_{} \ge 0\).
In order to prove the second statement, we use the fact that there exists a unique symmetric and positive square root \(Q^{\nicefrac {1}{2}} \in \mathcal {L}(H)\) such that \(Q = Q^{\nicefrac {1}{2}}Q^{\nicefrac {1}{2}}\). Since \(\Vert Q\Vert = \sup _{x \in H} ( Q x , x )_{} = \sup _{x \in H} ( Q^{\frac{1}{2}} x , Q^{\frac{1}{2}} x )_{} = \Vert Q^{\nicefrac {1}{2}}\Vert ^2\), also \(Q^{\nicefrac {1}{2}}\) is contractive. Thus, it follows that
Now, we prove the third statement. First we notice that \(( Qu , u )_{} \ge \beta \Vert u\Vert ^2\) and \(( Qu , u )_{} \le \Vert Qu\Vert \Vert u\Vert\) imply that \(\Vert Qu\Vert \ge \beta \Vert u\Vert\) for all \(u \in H\). Substituting \(v = Q^{-1}u\), then shows \(\Vert v\Vert \ge \beta \Vert Q^{-1}v \Vert\), which proves the final claim. \(\square\)
The previous lemma now allows us to extend [32, Theorem 10], which we have reformulated and restructured to match our setting. It relates the contraction factors of the true and approximated operators.
Lemma 8
Let Assumption 1 be fulfilled and let \(\tilde{f}(\cdot , \xi )\) be given as in (6). Then
holds for every \(u,v \in H\).
Proof
For better readability, we once again omit \(\xi\) where there is no risk of confusion. For \(u,v \in \mathcal {D}\left( \nabla f \right)\) with \(u \ne v\) and \(\varepsilon >0\), we approximate the function \(\tilde{r}(\cdot , \xi )\) defined in (7) by
where
As we can write
\(\tilde{r}_{\varepsilon }\) is well-defined. The derivative is given by \(\nabla \tilde{r}_{\varepsilon }(\cdot ,\xi ) :H \times \Omega \rightarrow H^*\),
This function \(\nabla \tilde{r}_{\varepsilon }\) is an interpolation between the points
Furthermore, since \(T_{\tilde{f} + \tilde{r}_{\varepsilon }} = (I + \iota \nabla \tilde{f} + \iota \nabla \tilde{r}_{\varepsilon })^{-1}\), it follows that
and therefore
Applying Lemma 5, we find that
This shows that
Using the explicit representation of \(T_{\tilde{f} + \tilde{r}_{\varepsilon }}\) from Lemma 4, it follows that
Therefore, we have
since
means that we can apply Lemma 7. Thus, this shows that \(T_f u= T_{f + \tilde{r}_{\varepsilon }} u\) and \(T_f v = \lim _{\varepsilon \rightarrow 0} T_{\tilde{f} + \tilde{r}_{\varepsilon }} v\). Further, we can state an explicit representation for \(T_{\tilde{f}}\) using Lemma 4 given by
For \(n = \frac{u-v}{\Vert u-v\Vert }\) with \(\Vert n\Vert = 1\), we obtain using Lemma 7
Finally, as \(\mathbf {E}_{\xi } \left[ \frac{\Vert T_{\tilde{f}} u - T_{\tilde{f}} v \Vert }{\Vert u - v \Vert } \right]\) is finite, we can apply the dominated convergence theorem to obtain that
\(\square\)
After having established a connection between the contraction properties of \(T_{f,\xi }\) and \(T_{\tilde{f},\xi }\), the next step is to provide a concrete result for the contraction factor of \(T_{\tilde{f},\xi }\). Applying Lemma 4, we can express this resolvent in terms of \(M_{\xi }\), which is easier to handle due to its linearity. The following lemma extends [32, Theorem 11]. As we are in an infinite dimensional setting, we can no longer argue using the smallest eigenvalue of an operator. This proof instead uses the convexity parameters directly. Moreover, we provide an explicit, non-asymptotic, bound for the contraction constant.
Lemma 9
Let Assumption 1 be satisfied and let \(\tilde{f}(\cdot ,\xi )\) be given as in (6). Then for \(u, v \in H\) and \(\alpha > 0\),
is fulfilled. Furthermore, it follows that
Proof
Due to the explicit representation of \(T_{\alpha \tilde{f}, \xi }\) stated in Lemma 4, we find that
for \(u,v \in H\). As \(u-v\) does not depend on \(\Omega\), it follows that
Thus, we have reduced the problem to a question about “how contractive” the resolvent of \(M_{\xi }\) is in expectation. We note that for any \(u \in H\), we have
Due to Lemma 7 it follows that
The right-hand-side bound is a \(C^2(-\frac{1}{\mu _{\xi }},\infty )\)-function with respect to \(\alpha\) or even a \(C^2(\mathbb {R})\)-function if \(\mu _{\xi } = 0\). By a second-order expansion in a Taylor series we can therefore conclude that
Combining these results, we obtain
\(\square\)
Finally, the proof of the main theorem relies on iterating the step-wise bounds arising from the contraction properties of the resolvents which we just established. This leads to certain products of the contraction factors. The following algebraic inequalities show that these are bounded in the desired way. While this type of result has been stated previously for first-order polynomials in 1/j (see e.g. [24, Theorem 14]), we prove here a particular version for second-order polynomials that matches the approximation of the contraction factor stated in Lemma 9.
Lemma 10
Let \(C_1, C_2>0\), \(p>0\) and \(r \ge 0\) satisfy \(C_1p > r\) and \(4C_2 \ge C_1^2\). Then the following inequalities are satisfied:
-
(i)
\(\prod _{j= 1}^k \left(1-\frac{C_1}{j} + \frac{C_2}{j^2}\right)^{p} \le \mathrm {exp}\left(\frac{C_2 p \pi ^2}{6}\right) (k+1)^{-C_1p}\),
-
(ii)
\(\sum _{j=1}^k \frac{1}{ j^{1+r}} \prod _{i = j+1}^k \left(1-\frac{C_1}{i} + \frac{C_2}{i^2}\right)^{p} \le 2^{C_1p} \mathrm {exp}\left(\frac{C_2 p \pi ^2}{6}\right) \frac{1}{C_1 p-r} (k+1)^{-r}.\)
Proof
The proof relies on the trivial inequality \(1 + u \le \mathrm {e}^{u}\) for \(u \ge - 1\) and the following two basic inequalities involving (generalized) harmonic numbers
The first one follows quickly by treating the sum as a lower Riemann sum approximating the integral \(\int _{m}^{k+1} {u^{-1} \,\mathrm {d}u}\). The second one can be proved analogously by approximating the integral \(\int _0^{k+1} {u^{C-1}\,\mathrm {d}u}\) with an upper (\(C<1\)) or lower (\(C>1\)) Riemann sum.
The condition \(4C_2 \ge C_1^2\) implies that all the factors in the product (i) are positive. We therefore have that \(0\le 1-\frac{C_1}{j} + \frac{C_2}{j^2} \le \mathrm {exp}\big(-\frac{C_1}{j}\big) \mathrm {exp}\big(\frac{C_2}{j^2}\big)\). Thus, it follows that
from which the first claim follows directly. For the second claim, we similarly have
where the latter sum can be bounded by
The final inequality is where we needed \(C_1p > r\), in order to have something better than \(j^{-1}\) in the sum. \(\square\)
4 Proof of main theorem
Using the lemmas presented in the previous section, we are now in a position to prove Theorem 1. Compared to the earlier results in the literature, we can provide a more general result with respect to the Lipschitz condition. More precisely, with the help of our a priori bound from Lemma 2, we can exchange the global Lipschitz condition by a local Lipschitz condition.
Proof of Theorem 1
Given the sequence of mutually independent random variables \(\xi ^k\), we abbreviate the random functions \(f_k = f(\cdot , \xi ^k)\) and \(T_{k} = T_{\alpha _k f, \xi^k}\), \(k \in \mathbb {N}\). Then the scheme can be written as \(w^{k+1} = T_k w^{k}\). If \(T_k w^* = w^*\), we would essentially only have to invoke Lemmas 8 and 9 to finish the proof. But due to the stochasticity, this does not hold, so we need to be more careful.
We begin by adding and subtracting the term \(T_k w^*\) and find that
By Lemmas 8 and 9 the expectation \(\mathbf {E}_{\xi ^k}\) of the first term on the right-hand side is bounded by \((1 - 2\mu \alpha _k + 3\nu ^2\alpha _k^2)^{\nicefrac {1}{2}}\Vert w^{k} - w^* \Vert ^2\) while by Lemma 6 the last term is bounded in expectation by \(\alpha _k^2 \sigma ^2\). The second term is the problematic one. We add and subtract both \(w^{k}\) and \(w^*\) in order to find terms that we can control:
In order to bound \(I_1\) and \(I_2\), we first need to apply the a priori bound from Lemma 2. This will also enable us to utilize the local Lipschitz condition. First, we notice that due to Lemma 6, we find that
is bounded for \(j \le 2^m\). As \(T_k\) is a contraction, we also obtain
Thus, there exists a random variable \(R_1\) such that
and \(\mathbf {E}_k [R_1^j]\) is bounded for \(j \le 2^m\). For \(I_1\), we then obtain that
where we used the fact that \(T_k\) is contractive in the last step. Taking the expectation, we then have by Hölder’s inequality that
where
As P is a polynomial of at most order \(2^m -2\), the expression only contains terms \(R_1^j\) where the exponent j is at most \(\left({\frac{2^{m}}{2^{m} -2}}\right) \left(2^m -2\right) = 2^{m}\). Hence \(\tilde{L}_1\) is bounded, and in view of Lemma 2 we get that
where \(D_1 \ge 0\) is a constant depending only on \(\Vert w^*\Vert\), \(\Vert w_1 - w^*\Vert\), \(\sigma\) and \(\eta\). For \(I_2\), we add and subtract \(\alpha _k \iota \nabla f_k (w^*)\) to get
Since \(w^{k} - w^*\) is independent of \(\alpha _k \nabla f_k (w^*)\), it follows that
Using the Cauchy-Schwarz inequality and Lemma 6, we find that
where \(R_2 = \max (\Vert w^*\Vert ,\Vert \nabla f_k (w^*)\Vert _{H^*})\) and
Just as for \(I_1\), we therefore get by Lemma 2 that
where \(D_2 \ge 0\) is a constant depending only on \(\Vert w^*\Vert\), \(\Vert w_1 - w^*\Vert\), \(\sigma\) and \(\eta\).
Summarising, we now have
with \(\tilde{C}_k = \left(1 - 2\mu \alpha _k + 3\nu ^2\alpha _k^2 \right)^{\nicefrac {1}{2}}\) and \(D = \sigma ^2 + D_1 + D_2\). Recursively applying the above bound yields
Applying Lemma 10 (i) and (ii) with \(p=\nicefrac {1}{2}\), \(r=1\), \(C_1 = 2\mu \eta\) and \(C_2 = 3\nu ^2\eta ^2\) then shows that
and
Thus, we finally arrive at
where C depends on \(\Vert w^*\Vert\), \(\Vert w_1 - w^*\Vert\), \(\mu\), \(\sigma\) and \(\eta\). \(\square\)
Remark 4
The above proof is complicated mainly due to the stochasticity and due to the lack of strong convexity. We consider briefly the simpler, deterministic, full-batch, case with
where F is strongly convex with convexity constant \(\mu\). Then it can easily be shown that
This means that
i.e. the resolvent is a strict contraction. Since \(\nabla F(w^*) = 0\), it follows that \(\left( I + \alpha \nabla F \right)^{-1} w^* = w^*\) so a simple iterative argument shows that
Using \((1 + \alpha \mu )^{-1} \le 1 - \mu \alpha + \mu ^2\alpha ^2\), choosing \(\alpha _k = \eta /k\) and applying Lemma 10 then shows that
for appropriately chosen \(\eta\). In particular, these arguments do not require the Lipschitz continuity of \(\nabla F\), which is needed in the stochastic case to handle the terms arising due to \(\nabla f(w^*, \xi ) \ne 0\).
5 Numerical experiments
In order to illustrate our results, we set up a numerical experiment along the lines given in the introduction. In the following, let \(H = L^2(0,1)\) be the Lebesgue space of square integrable functions equipped with the usual inner product and norm. Further, let \(x_j^i \in H\) for \(i = 1\), \(j = 1, \dots , \left\lfloor \frac{n}{2}\right\rfloor\) and \(i = 2\), \(j = \left\lfloor \frac{n}{2}\right\rfloor + 1, \ldots , n\) be elements from two different classes within the space H. In particular, we choose each \(x_j^1\) to be a polynomial of degree 4 and each \(x_j^2\) to be a trigonometric function with bounded frequency for \(j = 1,\dots ,n\). The polynomial coefficients and the frequencies were randomly chosen.
We want to classify these functions as either polynomial or trigonometric. To do this, we set up an affine (SVM-like) classifier by choosing the loss function \(\ell (h,y)= \ln (1 + \mathrm {e}^{-hy})\) and the prediction function \(h( [w,\overline{w}], x) = ( w , x )_{} + \overline{w}\) with \([w,\overline{w}] \in L^2(0,1) \times \mathbb {R}\). Without \(\overline{w}\), this would be linear, but by including \(\overline{w}\) we can allow for a constant bias term and thereby make it affine. We also add a regularization term \(\frac{\lambda }{2} \Vert w\Vert ^2\) (not including the bias), such that the minimization objective is
where \([x_j, y_j] = [x^1_j, -1]\) if \(j \le \left\lfloor \frac{n}{2} \right\rfloor\) and \([x_j, y_j] = [x^2_j, 1]\) if \(j > \left\lfloor \frac{n}{2}\right\rfloor\), similar to Eq. (2). In one step of SPI, we use the function
with a random variable \(\xi :\Omega \rightarrow \{1,\dots ,n\}\). Since we cannot do computations directly in the infinite-dimensional space, we discretize all the functions using N equidistant points in [0, 1], omitting the endpoints. For each N, this gives us an optimization problem on \(\mathbb {R}^N\), which approximates the problem on H.
For the implementation, we make use of the following computational idea, which makes SPI essentially as fast as SGD. Differentiating the chosen \(\ell\) and h shows that the scheme is given by the iteration
where \(c_k = \frac{\alpha _k y_k }{ 1 + \mathrm {exp}( ( w^{k+1}, \,x_k ) y_k + \overline{w}^{k+1} y_k)}\). This is equivalent to
Inserting the expression for \([w,\overline{w}]^{k+1}\) in the definition of \(c_k\), we obtain that
We thus only need to solve one scalar-valued equation. This is at most twice as expensive as SGD, since the equation solving is essentially free and the only additional costly term is \(( x_k , x_k )_{}\) (the term \(( w^k , x_k )_{}\) of course has to be computed also in SGD). By storing the scalar result, the extra cost will be essentially zero if the same sample is revisited. We note that extending this approach to larger batch-sizes is straightforward. If the batch size is B, then one has to solve a B-dimensional equation.
Using this idea, we implemented the method in Python and tested it on a series of different discretizations. We took \(n = 1000\), i.e. 500 functions of each type, \(M = 10{,}000\) time steps and discretization parameters \(N = 100 \cdot 2^i\) for \(i = 1, \ldots , 11\) to approximate the infinite dimensional space \(L^2(0,1)\). We used \(\lambda = 10^{-3}\) and the initial step size \(\eta = \nicefrac{2}{\lambda }\), since in this case it can be shown that \(\mu \ge \lambda\). There is no closed-form expression for the exact minimum \(w^*\), so instead we ran SPI with 10M time steps and used the resulting reference solution as an approximation to \(w^*\). Further, we approximated the expectation \(\mathbf {E}_k\) by running the experiment 100 times and averaging the resulting errors. In order to compensate for the vectors becoming longer as N increases, we measure the errors in the RMS-norm \(\Vert \cdot \Vert _N = \Vert \cdot \Vert _{\mathbb {R}^N} / \sqrt{N+1}\). As \(N \rightarrow \infty\), this tends to the \(L^2\) norm.
Figure 1 shows the resulting approximated errors \(\mathbf {E}_{k-1}[\Vert w^{k} - w^*\Vert _N^2]\). As expected, we observe convergence proportional to \(\nicefrac {1}{k}\) for all N. The error constants do vary to a certain extent, but they are reasonably similar. As the problem approaches the infinite-dimensional case, they vary less. In order to decrease the computational requirements, we only compute statistics at every 100 time steps, this is why the plot starts at \(k = 100\).
Approximated errors \(\mathbf {E}_{k-1}[\Vert w^{k} - w^*\Vert _N^2]\) for the SPI method, measured in RMS-norm, for discretizations with varying number of grid points N. Statistics were only computed at every 100 time steps, this is why the plot starts at \(k = 100\). The 1/k-convergence is clearly seen by comparing to the uppermost solid black reference line
In contrast, redoing the same experiment but with the explicit SGD method instead results in Fig. 2. We note that except for \(N = 200\) and \(N=400\), the method seemingly does not converge at all. This is partially explained by the fact that the Lipschitz constant grows with N (at least for the coarsest discretizations, for which we could estimate it), such that we get closer to the stability boundary. The main reason, however, is because of rare “bad” paths. In those, the method initially takes a large step in the wrong direction. Theoretically, it will eventually recover from this. In practice, it does not, due to the finite computational budget. Even when such bad paths are omitted from the results and \(\mathcal {O}(1/k)-\)convergence is observed, the errors are much larger than in Fig. 1. Many more steps would be necessary to reach the same accuracy as SPI. Since our implementations are certainly not optimal in any sense, we do not show a comparison of computational times here. They are, however, very similar, meaning that SPI is more efficient than SGD for this problem.
Approximated errors \(\mathbf {E}_{k-1}[\Vert w^{k} - w^*\Vert _N^2]\) for the SGD method, measured in RMS-norm, for discretizations with varying number of grid points N. Statistics were only computed at every 100 time steps, this is why the plot starts at \(k = 100\). Except for \(N = 200\) and \(N=400\), the method does not converge at all. Even when it does, the errors are much larger than in Fig. 1
6 Conclusions
We have rigorously proved convergence with an optimal rate for the stochastic proximal iteration method in a general Hilbert space. This improves the analysis situation in two ways. Firstly, by providing an extension of similar results in a finite-dimensional setting to the infinite-dimensional case, as well as extending these to more general operators. Secondly, by improving on similar infinite-dimensional results that only achieve convergence, without any error bounds. The latter improvement comes at the cost of stronger assumptions on the cost functional. Global Lipschitz continuity of the gradient is, admittedly, a rather strong assumption. However, as we have demonstrated, this can be replaced by local Lipschitz continuity where the maximal growth of the Lipschitz constant is determined by higher moments of the gradient applied to the minimum. This is a weaker condition. Finally, we have seen that the theoretical results are applicable also in practice, as demonstrated by the numerical results in the previous section.
Availability of data and materials
Not applicable.
Code availability
The code used for the numerical experiments is available on request from the authors.
References
Agarwal, A., Bartlett, P.L., Ravikumar, P., Wainwright, M.J.: Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Trans. Inf. Theory 58(5), 3235–3249 (2012). https://doi.org/10.1109/TIT.2011.2182178
Asi, H., Duchi, J.C.: Modeling simple structures and geometry for better stochastic optimization algorithms. In: Chaudhuri, K., Sugiyama, M. (eds.) Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 89, pp. 2425–2434. PMLR, Naha, Japan (2019). https://proceedings.mlr.press/v89/asi19a.html
Asi, H., Duchi, J.C.: Stochastic (approximate) proximal point methods: convergence, optimality, and adaptivity. SIAM J. Optim. 29(3), 2257–2290 (2019). https://doi.org/10.1137/18M1230323
Barbu, V.: Nonlinear Differential Equations of Monotone Types in Banach Spaces, p. 272. Springer, New York (2010). https://doi.org/10.1007/978-1-4419-5542-5
Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 2nd edn. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC, p. 619. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-48311-5
Bertsekas, D.P.: Incremental proximal methods for large scale convex optimization. Math. Program. 129(2, Ser. B), 163–195 (2011). https://doi.org/10.1007/s10107-011-0472-0
Bianchi, P.: Ergodic convergence of a stochastic proximal point algorithm. SIAM J. Optim. 26(4), 2235–2260 (2016). https://doi.org/10.1137/15M1017909
Bianchi, P., Hachem, W.: Dynamical behavior of a stochastic forward–backward algorithm using random monotone operators. J. Optim. Theory Appl. 171(1), 90–120 (2016). https://doi.org/10.1007/s10957-016-0978-y
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018). https://doi.org/10.1137/16M1080173
Brzeźniak, Z., Carelli, E., Prohl, A.: Finite-element-based discretizations of the incompressible Navier–Stokes equations with multiplicative random forcing. IMA J. Numer. Anal. 33(3), 771–824 (2013). https://doi.org/10.1093/imanum/drs032
Clark, D.S.: Short proof of a discrete Gronwall inequality. Discrete Appl. Math. 16(3), 279–281 (1987). https://doi.org/10.1016/0166-218X(87)90064-3
Davis, D., Drusvyatskiy, D.: Stochastic model-based minimization of weakly convex functions. SIAM J. Optim. 29(1), 207–239 (2019). https://doi.org/10.1137/18M1178244
Eckstein, J., Bertsekas, D.P.: On the Douglas–Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 55(3, Ser. A), 293–318 (1992). https://doi.org/10.1007/BF01581204
Eisenmann, M.: Methods for the temporal approximation of nonlinear, nonautonomous evolution equations. PhD thesis, TU Berlin (2019)
Eisenmann, M., Kovács, M., Kruse, R., Larsson, S.: On a randomized backward Euler method for nonlinear evolution equations with time-irregular coefficients. Found. Comput. Math. 19(6), 1387–1430 (2019). https://doi.org/10.1007/s10208-018-09412-w
Fagan, F., Iyengar, G.: Unbiased scalable softmax optimization. ArXiv Preprint, arXiv:1803.08577 (2018)
Güler, O.: On the convergence of the proximal point algorithm for convex minimization. SIAM J. Control Optim. 29(2), 403–419 (1991). https://doi.org/10.1137/0329022
Hager, W.W.: Updating the inverse of a matrix. SIAM Rev. 31(2), 221–239 (1989). https://doi.org/10.1137/1031049
Lasiecka, I., Triggiani, R.: Control Theory for Partial Differential Equations: Continuous and Approximation Theories. I. Encyclopedia of Mathematics and its Applications, vol. 74, p. 644. Cambridge University Press, Cambridge (2000). https://doi.org/10.1017/CBO9781107340848 . (Abstract parabolic systems)
Necoara, I.: General convergence analysis of stochastic first-order methods for composite optimization. J. Optim. Theory Appl. 189(1), 66–95 (2021). https://doi.org/10.1007/s10957-021-01821-2
Papageorgiou, N.S.: Convex integral functionals. Trans. Am. Math. Soc. 349(4), 1421–1436 (1997). https://doi.org/10.1090/S0002-9947-97-01478-5
Papageorgiou, N.S., Winkert, P.: Applied Nonlinear Functional Analysis. An Introduction, p. 612. De Gruyter, Berlin (2018). https://doi.org/10.1515/9783110532982
Patrascu, A., Irofti, P.: Stochastic proximal splitting algorithm for composite minimization. ArXiv Preprint, arXiv:1912.02039v2 (2020)
Patrascu, A., Necoara, I.: Nonasymptotic convergence of stochastic proximal point methods for constrained convex optimization. J. Mach. Learn. Res. 18(198), 1–42 (2018)
Quarteroni, A., Valli, A.: Domain Decomposition Methods for Partial Differential Equations Numerical Mathematics and Scientific Computation, p. 360. Oxford University Press, New York (1999)
Raginsky, M., Rakhlin, A.: Information-based complexity, feedback and dynamics in convex programming. IEEE Trans. Inf. Theory 57(10), 7036–7056 (2011). https://doi.org/10.1109/TIT.2011.2154375
Rockafellar, R.T.: On the maximal monotonicity of subdifferential mappings. Pac. J. Math. 33, 209–216 (1970)
Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14(5), 877–898 (1976). https://doi.org/10.1137/0314056
Rockafellar, R.T., Wets, R.J.-B.: On the interchange of subdifferentiation and conditional expectations for convex functionals. Stochastics 7(3), 173–182 (1982). https://doi.org/10.1080/17442508208833217
Rosasco, L., Villa, S., Vũ, B.C.: Convergence of stochastic proximal gradient algorithm. Appl. Math. Optim. 82(3), 891–917 (2020). https://doi.org/10.1007/s00245-019-09617-7
Roubíček, T.: Nonlinear Partial Differential Equations with Applications, 2nd edn., p. 476. Springer, Basel (2013). https://doi.org/10.1007/978-3-0348-0513-1
Ryu, E., Boyd, S.: Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent. www.math.ucla.edu/eryu/papers/spi.pdf (2016). Accessed 20 March 2020
Ryu, E.K., Yin, W.: Proximal–proximal–gradient method. J. Comput. Math. 37(6), 778–812 (2019). https://doi.org/10.4208/jcm.1906-m2018-0282
Salim, A., Bianchi, P., Hachem, W.: Snake: a stochastic proximal gradient algorithm for regularized problems over large graphs. IEEE Trans. Automat. Control 64(5), 1832–1847 (2019). https://doi.org/10.1109/tac.2019.2890888
Toulis, P., Airoldi, E.M.: Scalable estimation strategies based on stochastic approximations: classical results and new insights. Stat. Comput. 25(4), 781–795 (2015). https://doi.org/10.1007/s11222-015-9560-y
Toulis, P., Airoldi, E.M.: Asymptotic and finite-sample properties of estimators based on stochastic gradients. Ann. Stat. 45(4), 1694–1727 (2017). https://doi.org/10.1214/16-AOS1506
Toulis, P., Airoldi, E., Rennie, J.: Statistical analysis of stochastic gradient methods for generalized linear models. In: Xing, E.P., Jebara, T. (eds.) Proceedings of the 31st International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 32, pp. 667–675. PMLR, Bejing, China (2014). https://proceedings.mlr.press/v32/toulis14.html
Toulis, P., Horel, T., Airoldi, E.M.: The proximal Robbins–Monro method. ArXiv Preprint, arXiv:1510.00967v4 (2020)
Toulis, P., Tran, D., Airoldi, E.: Towards stability and optimality in stochastic gradient descent. In: Gretton, A., Robert, C.C. (eds.) Proceedings of the 19th International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 51, pp. 1290–1298. PMLR, Cadiz, Spain (2016). http://proceedings.mlr.press/v51/toulis16.html
Tran, D., Toulis, P., Airoldi, E.M.: Stochastic gradient descent methods for estimation with large data sets. ArXiv Preprint, arXiv:1509.06459 (2015)
Zeidler, E.: Nonlinear Functional Analysis and Its Applications. II/B, pp. 469–1202. Springer, New York (1990). https://doi.org/10.1007/978-1-4612-0985-0. Nonlinear monotone operators
Acknowledgements
The authors would like to thank the anonymous referee and Eskil Hansen for valuable feedback.
Funding
Open access funding provided by Lund University. This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation.
Author information
Authors and Affiliations
Contributions
All authors contributed to all parts of the article.
Corresponding author
Ethics declarations
Computations
The computations were enabled by resources provided by the Swedish National Infrastructure for Computing (SNIC) at LUNARC partially funded by the Swedish Research Council through grant agreement no. 2018–05973.
Conflict of interest/Competing interests
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Eisenmann, M., Stillfjord, T. & Williamson, M. Sub-linear convergence of a stochastic proximal iteration method in Hilbert space. Comput Optim Appl 83, 181–210 (2022). https://doi.org/10.1007/s10589-022-00380-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-022-00380-0