Stochastic proximal gradient methods for nonconvex problems in Hilbert spaces

For finite-dimensional problems, stochastic approximation methods have long been used to solve stochastic optimization problems. Their application to infinite-dimensional problems is less understood, particularly for nonconvex objectives. This paper presents convergence results for the stochastic proximal gradient method applied to Hilbert spaces, motivated by optimization problems with partial differential equation (PDE) constraints with random inputs and coefficients. We study stochastic algorithms for nonconvex and nonsmooth problems, where the nonsmooth part is convex and the nonconvex part is the expectation, which is assumed to have a Lipschitz continuous gradient. The optimization variable is an element of a Hilbert space. We show almost sure convergence of strong limit points of the random sequence generated by the algorithm to stationary points. We demonstrate the stochastic proximal gradient algorithm on a tracking-type functional with a \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L^1$$\end{document}L1-penalty term constrained by a semilinear PDE and box constraints, where input terms and coefficients are subject to uncertainty. We verify conditions for ensuring convergence of the algorithm and show a simulation.


Introduction
In this paper, we focus on stochastic approximation methods for solving a stochastic optimization problem on a Hilbert space H of the form where the expectation j(u) = [J(u, )] is generally nonconvex with a Lipschitz continuous gradient and h is a proper, lower semicontinuous, and convex function that is generally nonsmooth.
Our work is motivated by applications to PDE-constrained optimization under uncertainty, where a nonlinear PDE constraint can lead to an objective function that is nonconvex with respect to the Hilbert-valued variable. To handle the (potentially infinite-dimensional) expectation, algorithmic approaches for solving such problems involve either some discretization of the stochastic space or an ensemble-based approach with sampling or carefully chosen quadrature points. Stochastic discretization includes polynomial chaos and the stochastic Galerkin method; cf. [24,30,34,47]. For ensemble-based methods, the simplest method is sample average approximation (SAA), where the original problem is replaced by a proxy problem with a fixed set of samples, which can then be solved using a deterministic solver. A number of standard improvements to Monte Carlo sampling have been applied to optimal control problems in, e.g., [1,54]. Another ensemble-based approach is the stochastic collocation method, which has been used in optimal control problems in e.g. [47,51]. Sparse-tensor discretization has been used for optimal control problems in, for instance, [28,29].
The approach we use is an ensemble-based approach called stochastic approximation, which is fundamentally different in the sense that sampling takes place dynamically as part of the optimization procedure, leading to an algorithm with low complexity and computational effort when compared to other approaches. Stochastic approximation originated in a groundbreaking paper by [45], where an iterative method to find the root of an unknown function using noisy estimates was proposed. The authors of [25] used this idea to solve a regression problem using finite differences subject to noise. Algorithms of this kind, with bias in addition to stochastic noise, are sometimes called stochastic quasi-gradient methods; see, e.g., [17,53]. Basic versions of these algorithms rely on positive step sizes t n of the form ∑ ∞ n=1 t n = ∞ and ∑ ∞ n=1 t 2 n < ∞ . The (almost sure) asymptotic convergence of stochastic approximation algorithms for convex problems is classical in finite dimensions; we refer to the texts by [16,33].
There have been a number of contributions with proofs of convergence of the stochastic gradient method for unconstrained nonconvex problems; see [6,7,49,56]. Fewer results exist for constrained and/or nonsmooth nonconvex problems. A randomized stochastic algorithm was proposed by Ghadimi et al. [21]; this scheme involves running a stochastic approximation process and randomly choosing an iterate from the generated sequence. There have been some contributions involving constant step sizes with increasing sampling; see [35,44]. Convergence (P) min u∈H {f (u) = j(u) + h(u)}, Stochastic proximal gradient methods for nonconvex problems… of projection-type methods for nonconvex problems was shown in [32] and for prox-type methods by Davis et al. [13].
As far as stochastic approximation on function spaces is concerned, many contributions were motivated by applications with nonparametric statistics. Perhaps the oldest example is from [55]. Goldstein [22] studied an infinite-dimensional version of the Kiefer-Wolfowitz procedure. A significant contribution for unconstrained problems was by Yin and Zhu [58]. Projection-type methods were studied by [3,10,12,40].
In this paper, we prove convergence results for nonconvex and nonsmooth problems in Hilbert spaces. We present convergence analysis that is based on the recent contributions in [13,35]. Applications of the stochastic gradient method to PDEconstrained optimization have already been explored by [19,37]. In these works, however, convexity of the objective function is assumed, leaving the question of convergence in the more general case entirely open. We close that gap by making the following contributions: -For an objective function that is the sum of a smooth, generally nonconvex expectation and a convex, nonsmooth term, we prove that strong accumulation points of iterates generated by the method are stationary points. -We show that convergence holds even in the presence of systematic additive bias, which is relevant for the application in mind. -We demonstrate the method on an application to PDE-constrained optimization under uncertainty and verify conditions for convergence.
The paper is organized as follows. In Sect. 2, notation and background is given. Convergence of two related algorithms is proven in Sect. 3. In Sect. 4, we introduce a problem in PDE-constrained optimization under uncertainty, where coefficients in the semilinear PDE constraint are subject to uncertainty. The problem is shown to satisfy conditions for convergence, and numerical experiments demonstrate the method. We finish the paper with closing remarks in Sect. 5.

Notation and background
We recall some notation and background from convex analysis and stochastic processes; see [4,11,38,43].
Let H be a Hilbert space with the scalar product ⟨⋅, ⋅⟩ and norm ‖ ⋅ ‖ . The symbols → and ⇀ denote strong and weak convergence, respectively. The set of proper, convex, and lower semicontinuous functions h ∶ H → (−∞, ∞] is denoted by 0 (H) . Given a function h ∈ 0 (H) and t > 0 , the proximity operator prox th ∶ H → H is given by We recall that for a proper function h ∶ H → (−∞, ∞] , the subdifferential (in the sense of convex analysis) is the set-valued operator For any h ∈ 0 (H) , the subdifferential h is maximally monotone. The domain of h is denoted by dom(h) . The indicator function of a set C is denoted by C , where C (u) = 0 if u ∈ C and C (u) = ∞ otherwise. The sum of two sets A and B with ∈ ℝ is given by The distance of a point u to a nonempty, closed set A is denoted by d(u, A) ∶= inf a∈A ‖u − a‖ and the diameter of A is denoted by the symbol diam(A) ∶= sup u,v∈A ‖u − v‖ . For a nonempty and convex set C, the normal cone N C (u) at u ∈ C is defined by If h is proper and dom(h) , then h(u) is closed and convex. We recall that the graph of h for a function h ∈ 0 (H) , given by the set gra( h) = {(u, h(u)) ∶ u ∈ H} , is sequentially closed in the strong-to-weak topology, meaning that for u n → u , n ∈ h(u n ) , and n ⇀ , it follows that ∈ h(u) . The normal cone N C (u) is strong-to-weak sequentially closed if C is convex.
Throughout, ( , F, ℙ) will denote a probability space, where represents the sample space, F ⊂ 2 is the -algebra of events on the power set of , denoted by 2 , and ℙ ∶ → [0, 1] is a probability measure. Given a random vector ∶ → ⊂ ℝ m , we write ∈ to denote a realization of the random vector. The operator [⋅] denotes the expectation with respect to this distribution; for a parametrized functional J ∶ H × → ℝ , this is defined as the integral over all elements in , i.e., A filtration is a sequence {F n } of sub--algebras of F such that F 1 ⊂ F 2 ⊂ ⋯ ⊂ F. We define a discrete H-valued stochastic process as a collection of H-valued random variables indexed by n, in other words, the set { n ∶ → H | n ∈ ℕ}. The stochastic process is said to be adapted to a filtration {F n } if and only if n is F n -measurable for all n. The natural filtration is the filtration generated by the sequence { n } and is given by F n = ({ 1 , … , n }). 1 If for an event F ∈ F it holds that ℙ(F) = 1 , or equivalently, ℙ( �F) = 0 , we say F occurs almost surely (a.s.). Sometimes we also say that such an event occurs with probability one. A sequence of random variables { n } is said to converge almost surely to a random variable if and only if [J(u, )] = ∫ J(u, ( )) dℙ( ). 1 The -algebra generated by a random variable ∶ → ℝ is given by ( ) = { −1 (B) ∶ B ∈ B} , where B is the Borel -algebra on ℝ . Analogously, the -algebra generated by the set of random variables { 1 , … , n } is the smallest -algebra such that i is measurable for all i = 1, … , n.

3
Stochastic proximal gradient methods for nonconvex problems… For an integrable random variable ∶ → ℝ , the conditional expectation is denoted by [ |F n ] , which is itself a random variable that is F n -measurable and which satisfies ∫ A [ |F n ]( ) dℙ( ) = ∫ A ( ) dℙ( ) for all A ∈ F n . Almost sure convergence of H-valued stochastic processes and conditional expectation are defined analogously. Given a random operator F ∶ X × → Y , where X and Y are Banach spaces, we will sometimes use the notation F ∶= F(⋅, ) ∶ X → Y for a fixed (but arbitrary) ∈ . For a Banach space (X, ‖ ⋅ ‖ X ) , the Bochner space L p ( , X) is the set of all (equivalence classes of) strongly measurable functions u ∶ → X having finite norm, where the norm is defined by For an open subset U of a Banach space X and a function J ∶ U → ℝ , we denote the Gâteaux derivative at u ∈ U in the direction v ∈ X by dJ (u;v). The Fréchet derivative at u is denoted by J � ∶ U → L(X, ℝ) , where L(X, ℝ) is the set of bounded and linear operators mapping X to ℝ . We recall this is none other than the dual space X * and we denote the dual pairing by ⟨⋅, ⋅⟩ X * ,X . For an open subset U of a Hilbert space H and a Fréchet differentiable function j ∶ U → ℝ , the gradient ∇j ∶ U → H is the Riesz representation of j � ∶ U → H * , i.e., it satisfies ⟨∇j(u), v⟩ = ⟨j � (u), v⟩ H * ,H for all u ∈ U and v ∈ H. In Hilbert spaces, the Riesz representation relates elements of the dual space to the Hilbert space itself, allowing us to drop the dual pairing notation and use simply ⟨⋅, ⋅⟩.
The notation C 1,1 L (U) is used to denote the set of continuously differentiable functions on U ⊂ H with an L-Lipschitz gradient, meaning ‖∇j(u) − ∇j(v)‖ ≤ L‖u − v‖ is satisfied for all u, v ∈ U. The following lemma gives a classical Taylor estimate for such functions.

Asymptotic convergence results
In this section, we show asymptotic convergence results for two variants of the stochastic proximal gradient method in Hilbert spaces for solving Problem (P). Let G ∶ H × → H be a parametrized operator (the stochastic gradient) approximating (in a sense to be specified later) the gradient ∇j ∶ H → H and let t n be a positive step size. Both algorithms in this section will share the basic iterative form where h is the nonsmooth term from Problem (P). The following assumptions will be in force in all sections. is adapted to F n and for K n ∶= ess sup ∈ ‖r n ( )‖ , ∑ ∞ n=1 t n K n < ∞ and sup n K n < ∞ are satisfied.
(iv) For all n, n ∶= G(u n , n ) − [G(u n , n )|F n ] is an H-valued random variable.

Remark 3.2
The assumption that the sequence {u n } stays bounded with probability one is by no means automatically fulfilled, but can be verified or enforced in different ways. We refer to [6, Section 5.2] and [13, Section 6.1] for conditions on the function, constraint set, and/or regularizers that ensure boundedness of iterates. The conditions in Assumption 3.1 allow for additive bias r n in the stochastic gradient in addition to zero-mean error n . The requirement that u n and r n are adapted to F n is automatically fulfilled if {F n } is chosen to be the natural filtration generated by { 1 , … , n } . Together, Assumption 3.1(iii) and Assumption 3.1(iv) imply and [ n |F n ] = 0. Notice that a single realization n ∈ can be replaced by m n independently drawn realizations 1 n , … , m n n ∈ since This set of m n samples is sometimes called a "batch"; batches clearly reduce the variance of the stochastic gradient.
The result in Sect. 3.1 shows asymptotic convergence of the proximal gradient method with constant step sizes and increasing sampling. In Sect. 3.2, we switch to the versatile ordinary differential equation (ODE) method to prove convergence of the stochastic proximal gradient method with decreasing step sizes. We emphasize that the convergence results generalize existing convergence theory from the finite-dimensional case. Our analysis includes convergence in possibly infinitedimensional Hilbert spaces. Additionally, we allow for stochastic gradients subject u n+1 ∶= prox t n h (u n − t n G(u n , n )), to additive bias, which is not covered by existing results. This theory can be used to develop mesh refinement strategies in applications with PDEs [20].

Variance-reduced stochastic proximal gradient method
In this section, we show under what conditions the variance-reduced stochastic proximal gradient method converges to stationary points for Problem (P). With n = ( 1 n , … , m n n ), the stochastic gradient is given by the average over an increasing number of samples m n . The algorithm is presented below, which uses constant step sizes t n ≡ t depending on the Lipschitz constant L from Assumption 3.1(ii).

Remark 3.3
If h(u) = C (u) and C denotes the projection onto C, then the algorithm reduces to u n+1 ∶= C , i.e., the variance-reduced projected stochastic gradient method.
In addition to Assumption 3.1, the following assumptions will be in force in this section. Remark 3. 5 We use assumptions similar to those found in [35], but we do not require the effective domain of h to be bounded; we instead use boundedness of the iterates by Assumption 3.1(i). Notice that w n = r n + n from Assumption 3.1(iv), hence Assumption 3.4(ii) also provides a condition on the rate at which r n and n must decay.
For the convergence result, we need the following lemma [46]. Lemma 3.6 (Robbins-Siegmund) Assume that {F n } is a filtration and v n , a n , b n , c n are nonnegative random variables adapted to F n . If and ∑ ∞ n=1 a n < ∞, s., then with probability one, {v n } is convergent and ∑ ∞ n=1 c n < ∞.
To show convergence, we first present a technical lemma.

3
Stochastic proximal gradient methods for nonconvex problems… Now, by (3.2) if and only if Finally, adding (3.7) and (3.8), and using that f = j + h , we get (3.1). ◻ In the following, we define as the iterate at n + 1 if the true gradient were used.

Lemma 3.8 For all n,
Proof Using Lemma 3.7 with v =ū n+1 , u = z = u n , and g = ∇j(u n ) , we have Again using Lemma 3.7, with v = u n+1 , z =ū n+1 , u = u n , and g = ∇j(u n ) + w n , we get

3
Taking conditional expectation on both sides of (3.13), and noting that ū n+1 is F nmeasurable by F n -measurability of u n , we get (3.10). ◻

Remark 3.9
Any bounded sequence {u n } in H contains a weakly convergent subsequence {u n k } such that u n k ⇀ u for a u ∈ H. Generally this convergence is not strong, so we cannot conclude from ‖ū n+1 − u n ‖ 2 → 0 that there exists a ũ such that, for a subsequence {u n k } , lim k→∞ūn k +1 = lim k→∞ u n k =ũ. Therefore, to obtain convergence to stationary points, we will assume that {u n } has a strongly convergent subsequence.
We are ready to state the convergence result for sequences generated by Algorithm 1. Proof The sequence {u n } is contained in a bounded set V by Assumption 3.1(i). By Assumption 3.4(i), h ∈ 0 (H) must therefore be bounded below on V [4, Corollary 9.20]; j is bounded below by Assumption 3.1(ii). W.l.o.g. we can thus assume f ≥ 0 . Since 1 2t > L and ∑ ∞ n=1 [‖w n ‖ 2 �F n ] < ∞ by Assumption 3.4(ii), we can apply Lemma 3.6 to (3.10) to conclude that f (u n ) converges almost surely. The second statement follows immediately, since by Lemma 3.6, which implies that for almost every sample path, lim n→∞ ‖ū n+1 − u n ‖ 2 = 0.
For the third statement, we have that there exists a subsequence {u n k } such that u n k → u . We argue that then ū n k +1 → u . Since {ū n k +1 } is bounded, there exists a weak limit point ũ (potentially on a subsequence with the same labeling). Then, using weak lower semicontinuity of the norm as well as the rule ⟨a n , b n ⟩ → ⟨a, b⟩ for a n ⇀ a and b n → b, implying u =ũ. It follows ū n k +1 → u by assuming lim k→∞ ‖ū n k +1 ‖ 2 ≠ ‖u‖ 2 and arriving at a contradiction. Now, by definition of the prox operator, . By optimality of ū n k +1 (see Fermat's rule, [4, Theorem 16.2]), 0 ∈ H(ū n k +1 ) , or equivalently, Taking the limit as k → ∞ , and using continuity of ∇j , we conclude by strong-toweak sequential closedness of gra( h) that so therefore u is a stationary point. ◻

Stochastic proximal gradient method: decreasing step sizes
An obvious drawback of Algorithm 1 is the fact that step sizes are restricted to small steps bounded by a factor depending on the Lipschitz constant, which in applications might be difficult to determine. Additionally, the algorithm requires increasing batch sizes to dampen noise, which is unattractive from a complexity standpoint. In this section, we obtain convergence with a nonsmooth and convex term h using the step size rule This step size rule dampens noise enough so that increased sampling is not necessary. We observe Problem (P) with For asymptotic arguments, it will be convenient to treat the term C separately. To that end, we define and note that f (u) = (u) + C (u). The stochastic gradient G(u, ) ∶ H × → H can be comprised of one or more samples as in the unconstrained case; see Remark 3.2.
The algorithm is now stated below.
To prove convergence of Algorithm 2, we will use the ODE method, which dates back to [33,36]. While we use many ideas from [13], we emphasize that we generalize results to (possibly infinite-dimensional) Hilbert spaces and moreover, we handle the case when j is the expectation.
We define the set-valued map S ∶ C ⇉ H by Additionally, we define the sequence of (single-valued) maps S n ∶ C → H for all n by In addition to Assumption 3.1, the following assumptions will apply in this section.

Remark 3.12
To handle the infinite-dimensional case, we use assumptions that are generally more restrictive than in [13]; we restrict ourselves to the case where C and are convex and we assume higher regularity of j in Assumption 3.1(ii) to handle the case j(u) = [J(u, )] . However, we allow for bias r n , which is not covered in [13]. We note that C does not need to be bounded if is Lipschitz continuous over C. Assumption 3.11(ii) is satisfied if dom( ) = H and maps bounded sets to bounded sets; see also [4,Proposition 16.17] for equivalent conditions. The last assumption is technical but standard; see [48,Assumption H4].
The main result is the following, which we will prove in several parts. Throughout, we use the notation g n ∶= G(u n , n ).
Proof Note that u n and r n are F n -measurable, so [g n |F n ] = ∇j(u n ) + r n . Then where we used that n is independent from 1 , … , n−1 , so By definition of y n and w n , we arrive at the conclusion. ◻

Lemma 3.15 For any
Proof By definition of the proximity operator, or equivalently (note ū, u ∈ C), (3.21)

3
Stochastic proximal gradient methods for nonconvex problems… followed by Lemma 3.15 and Assumption 3.11(iii), we get Let v n ∶= ∑ n j=1 t j w j . We show that v n is a square integrable martingale, i.e., v n ∈ L 2 ( , H) for every n and sup n [‖v n ‖ 2 ] < ∞. It is clearly a martingale, since for all n, [w n |F n ] = 0 and thus To show that v n is square integrable, we use (3.23) and the fact that [v n ] = 0 for all n to conclude that its quadratic variations are bounded. Indeed, Because of the condition (3.16), we have that sup n [A n ] < ∞. We have obtained that {v n } is square integrable, so by Lemma 3.17, it follows that {v n } converges a.s. to a limit as n → ∞ . ◻

Lemma 3.19
The following is true with probability one: Proof This is a simple consequence of (3.18) and a.s. boundedness of y n , r n , and w n for all n by Lemmas 3.16, 3.18, and Assumption 3.1(iii), respectively. ◻ Lemma 3.20 For any sequence {z n } in C such that z n → z as n → ∞, it follows that Proof Notice that C is closed, so z ∈ C . The fact that S(z) is nonempty, closed, and convex follows by these properties of ∇j(z) , (z) , and N C (z) . We define g n ∶= G(z n , ) and Clearly, [S n (z n , )] = S n (z n ). Now, by Jensen's inequality and convexity of the mapping u ↦ d(u, S(z)),  (3.26) S n (z n , ) ∶= −∇j(z n ) − 1 t n (z n − t n g n − prox t n h (z n − t n g n )).
Notice that z = prox th (u) if and only if 0 ∈ (z) + N C (z) + 1 t (z − u) , so with there exist ,n ∈ (z n ) and C,n ∈ N C (z n ) such that By strong-to-weak sequential closedness of gra( ) and gra(N C ) as well as continuity of ∇j , it follows that We show that d(S n (z n , ), S(z)) is almost surely bounded by an integrable function M (z) for all n. Using elementary arguments and (3.29) in the third inequality, (3.27) z n ∶= prox t n h (z n − t n g n ), (3.28) −( ,n + C,n ) = 1 t n (z n − z n + t n g n ).
which is almost surely bounded by Assumption 3.11(ii) and Assumption 3.11(iv). By the dominated convergence theorem, it follows by (3.30) that as n → ∞ , [d(S n (z n , ), S(z))] → 0 . Finally, (3.25) follows from the fact that if a n → 0 as n → ∞ , it follows that 1 m ∑ m n=1 a n → 0 as m → ∞ . ◻ Now we will show a compactness result, adapted from [15], namely that in the limit, the time shifts of the linear interpolation of the sequence {u n } can be made arbitrarily close to trajectories, or solutions, of the differential inclusion     (3.33) Proof Relative compactness of time shifts. We first claim that for all T > 0, We consider a fixed (but arbitrary) sample path = ( 1 , 2 , … ) throughout the proof. Let p ∶= min{n ∶ s n ≥ } and q ∶= max{n ∶ s n ≤ t} . By (3.33)

3
Stochastic proximal gradient methods for nonconvex problems… We take the limit p, q → ∞ on the right-hand side of (3.37) and observe that by Lemma 3.16, lim n→∞ sup m≥n t m ‖y m ‖ = 0 and by Lemma By Lemma 3.19 as well as convergence of the other terms on the right-hand side of (3.40), for > 0 there exists a N such that for all k, j > N , ‖u k (t) − u j (t)‖ ≤ for all k, j > N and thus {u n (t)} has a Cauchy subsequence for n → ∞ . Now we observe the case where the sequence { n } is bounded. Then n →̄ for some ̄> 0 at least on a subsequence (with the same labeling). By convergence of { n } we get that m j = n k for k, j ≥ N and N large enough. Therefore (3.38) reduces to We can bound terms on the right-hand side of (3.41) as before to obtain that {u n (t)} has a Cauchy subsequence. We have shown that A(t) is relatively compact for all t ∈ [0, T] , T > 0 , so by the Arzelà-Ascoli theorem, it follows that the set A is relatively compact.
Now, we will show that ȳ(t) ∈ S(ū(t)) for a.e. t ∈ [0, T] . By the Banach-Saks theorem (cf. [42]), there exists a subsequence of {y(⋅ + n k )} (where we use the same notation for the sequence as its subsequence) such that Recall that y n = S n (u n ) by Lemma 3.14 and set t k ∶= max{ ∶ s ≤ t + n k }. Then we have which a.s. converges to zero as k → ∞ , since u(⋅ + n k ) →ū(⋅) and the fact that t n → 0 by (3.16) (combined a.s. boundedness of y n , r n , and w n for all n by Lemma 3.16, Assumption 3.1(iii), and Lemma 3.18, respectively). Now, using y(t + n k ) = y t k , we get which converges to zero as m → ∞ by (3.43) and Lemma 3.20, where we note that u(s t k ) →ū(t) as k → ∞ by (3.44). Since S(ū(t)) is a closed set and the sample path was chosen to be arbitrary, we have that the statement must be true with probability one. ◻ Now, we show that there is always a strict decrease in along a trajectory that originates at a noncritical point z(0). .

S(z(s))) 2 ds
Stochastic proximal gradient methods for nonconvex problems… The following proof is standard, but we need to make several arguments differently in the infinite-dimensional setting. We will proceed as in [13]. We define the level sets of as Proposition 3.24 For all > 0 there exists a N such that for all n ≥ N, if u n ∈ L , then u n+1 ∈ L 2 a.s.
Proof First, we remark that is uniformly continuous on V, since (⋅) satisfies (3.17) and, in turn, is Lipschitz continuous on V, as well as the fact that j is Lipschitz continuous on V. Therefore, for any > 0 there exists a > 0 such that if ‖u n+1 − u n ‖ < , then | (u n+1 ) − (u n )| < . Now, we choose N such that ‖u n+1 − u n ‖ < for all n ≥ N , which is possible by Lemma 3.19. Then it must follow that | (u n+1 ) − (u n )| < for all n ≥ N as well. Now, since u n ∈ L , it follows that (u n+1 ) ≤ 2 , so therefore u n+1 ∈ L 2 . ◻

Lemma 3.25
The following equalities hold.
Proof We argue that lim inf n→∞ (u n ) ≤ lim inf t→∞ (u(t)) ; the other direction is clear by construction of u(⋅) from (3.32). Let { n } be a sequence such that n → ∞ , lim n→∞ u( n ) =ū for some ū ∈ H , and lim inf n→∞ (u( n )) = (ū) . With k n ∶= max{n ∶ t k ≤ n } , we get which converges to zero as n → ∞ by (3.24) and convergence of the sequence {u( n )}. Therefore u k n →ū and so by continuity of , it follows that Analogous arguments can be made for the claim ◻ Lemma 3. 26 Only finitely many iterates {u n } are contained in H∖L 2 .

3
and so on. We argue by contradiction and recall that s n = ∑ n−1 j=1 t j . Suppose infinitely many {u n } are in H∖L 2 , then it must follow that i j → ∞ as j → ∞ . By Theorem 3.22, {u(⋅ + s i j )} is relatively compact in C([0, T], H) for all T > 0 and there exists a subsequence (with the same labeling) and limit point z(⋅) such that z(⋅) is a trajectory of (3.31). Now, since by construction (u i j ) ≤ and (u i j +1 ) > , it follows that Recall that lim j→∞ u i j = u(⋅ + s i j ) = z(0) . Taking the limit j → ∞ on both sides of (3.51), by continuity of , we get meaning z(0) is not a critical point of . Thus we can invoke Lemma 3.23 to get the existence of a T > 0 such that By uniform convergence of u(⋅ + s i j ) to z(⋅) , it follows for j sufficiently large that so Therefore it must follow that for j sufficiently large. We now find a contradiction to the statement (3.53). This is done by observing the sequence j ∶= max{ ∶ s i j ≤ s ≤ s i j + T}. From (3.52), we have that there exists a > 0 such that (z(T)) ≤ − 2 . Observe that Therefore u j → u(T + s i j ) and hence u j → z(T) as j → ∞ . By continuity, we get lim j→∞ (u j ) = (z(T)). Thus (u j ) < − for j sufficiently large, a contradiction to (3.53). ◻ i 1 ∶= min{n ∶ u n ∈ L and u n+1 ∈ L 2 �L }, e 1 ∶= min{n ∶ n > i 1 and u n ∈ H�L 2 }, i 2 ∶= min{n ∶ n > e 1 and u n ∈ L },
Proof W.l.o.g. assume lim inf t→∞ (u(t)) = 0 ; this is possible by the fact that j and are bounded below. Choosing > 0 such that ∉ (S −1 (0)), we have by Lemma 3.26 that for N sufficiently large, u n ∈ L 2 for all n ≥ N . Since can be chosen to be arbitrarily small, we conclude that lim t→∞ (u(t)) = 0. ◻

Proof of Theorem 3.13
The fact that { (u n )} converges follows from Proposition 3.27 and Lemma 3.25. Since {u n } ⊂ C , it trivially follows that {f (u n )} converges a.s. Let ū be a limit point of {u n } and suppose that 0 ∉ S(ū) . Let {u n k } be a subsequence converging to ū and let z(⋅) be the limit of {u(⋅ + s n k )} . Then, by Lemma 3.23, there exists a T > 0 such that However, it follows from Proposition 3.27 that which is a contradiction to (3.54). ◻

Application to PDE-constrained optimization under uncertainty
In this section, we apply the algorithm presented in Sect. 3.2 to a nonconvex problem from PDE-constrained optimization under uncertainty. In Sect. 4.1, we set up the problem and verify conditions for convergence of the stochastic proximal gradient method. We show numerical experiments in Sect. 4.2.

Model problem
We first introduce notation and concepts specific to our application; see [18,52].
be an open and bounded Lipschitz domain. The inner product between vectors x, y ∈ ℝ d is denoted by . We will focus on a semilinear diffusion-reaction equation with uncertainties, which describes transport phenomena at equilibrium and is motivated by [41]. We assume that there exist random fields a ∶ D × → ℝ and r ∶ D × → ℝ , which are the diffusion and reaction coefficients, respectively. To facilitate simulation, we will make a standard finite-dimensional noise assumption, meaning the random field has the form where ( ) = ( 1 ( ), … , m ( )) is a vector of real-valued uncorrelated random variables i ∶ → i ⊂ ℝ . The support of the random vector will be denoted by ∶= ∏ m i=1 i . We consider the following PDE constraint, to be satisfied for almost every ∈ : Optimal control problems with semilinear PDEs involving random coefficients have been studied in, for instance, [26,27]. We include a nonsmooth term as in [14] with the goal of obtaining sparse solutions. In the following, we assume that 1 ≥ 0 , 2 ≥ 0 , and y D ∈ L 2 (D) . The model problem we solve is given by The following assumptions will apply in this section. In particular, we do not require uniform bounds on the coefficient a(⋅, ) , which allow for modeling with log-normal random fields.
Existence of a solution to Problem (P') follows by applying [27, Proposition 3.1]. The following result holds by [27, Proposition 2.1] combined with standard a priori estimates for a fixed realization to obtain (4.2) and (4.3).
A uniform mesh T with 9800 shape regular triangles T was used. We denote the mesh fineness with ĥ = max T∈T diam(T) . The state and adjoint were discretized using piecewise linear finite elements, (where P i denotes the space of polynomials of degree up to i), given by the set For the controls, we choose a discretization of L 2 (D) by piecewise constants, given by the set This is done to project the stochastic gradient onto the L 2 (D) space as in [20]. Hence, the last line of Algorithm 2 is given by the expression u n+1 ∶= prox t n h u n − t n P̂hG(u n , n ) . For the computation of the proximity operator prox t( + C ) (z) = arg min −0.5≤v≤0.5 { 1 ‖v‖ L 1 (D) + 1 2t ‖v − z‖ 2 L 2 (D) } , we use the formula from [5, Example 6.22], defined piecewise on each element of the mesh. For each T ∈ T , it is given by For convergence plots, we use a heuristic to approximate the objective function and the measure of stationarity by increasing sampling as the control reaches stationarity.
To be more precise, we use a sequence of sample sizes {m n } with m n = 10⌊ n 50 ⌋ + 1 newly generated i.i.d. samples ( n,1 , … , n,m n ) and compute The algorithm is terminated for n ≥ 50 if r n ∶= ∑ n k=n−50 r n ≤ tol with tol = 2e −4 . The parameters for our heuristic termination rule were tuned, for illustration purposes only, so that the algorithm stopped after several hundred iterations. A plot of the control after termination is shown in Fig. 1. The effect of the sparse term as well as the constraint set C can be seen clearly. Decay of the objective function value Ûh ∶= {u ∈ L 2 (D) ∶ v| T ∈ P 0 (T) for all T ∈ T}, Ĉh ∶= Ûh ∩ C.
P̂hG(u n , n,j ) . and the stationarity measure are shown in Fig. 2. We see convergence of the objective function values and the stationarity measure tends to zero as expected. Additionally, we conduct an experiment to demonstrate mesh independence of the algorithm by running the algorithm once each for different meshes and comparing the number of iterations needed until the tolerance tol is reached. In Table 1, we see that these iteration numbers are of the same order. The estimate for the objective function f N is also included at the final iteration N, demonstrating how solutions become more exact on finer meshes.

Conclusion
In this paper, we presented asymptotic convergence analysis for two variants of the stochastic proximal gradient algorithm in Hilbert spaces. The main results address the asymptotic convergence to stationary points of general functions defined over a Hilbert space. Moreover, we presented an application to the theory in the form of a problem from PDE-constrained optimization under uncertainty. Assumptions for convergence were verified for a tracking-type problem with a L 1 -penalty term subject to a semilinear elliptic PDE with random coefficients and box constraints. Numerical experiments demonstrated the effectiveness of the method.
The ODE method from Sect. 3.2 allowed us to prove a more general result with weaker assumptions on the objective function. However, we needed to introduce an assumption on the set of critical values in the form of Assumption 3.11(v). While we did not verify this assumption for our model problem, it would be interesting to know whether this assumption is verifiable for this class of problems. We had to be slightly more restrictive on the nonsmooth term in Sect. 3.2 than we were in Sect. 3.1. The advantages in terms of computational cost of Algorithm 2 over Algorithm 1 are clear: the use of decreasing step sizes in Algorithm 2 means that increased sampling is not needed. Additionally, there is no need to determine the Lipschitz constant for the gradient, which in the application depends on (among other things) the Poincaré constant and the lower bound on the random fields, and thus lead to a prohibitively small constant step size. This phenomenon has been demonstrated in [20].
How to scale the decreasing step size t n remains an open question. In practice, the scaling of the step size can be tuned offline. An improper choice of the scaling c in the step size t n = c∕n for 0.5 < ≤ 1 can lead to arbitrarily slow convergence; this was demonstrated in [39]. While this was not the focus of our work, efficiency estimates for nonconvex problems might also be possible following the work by [7,21,35]. In lieu of efficiency estimates, it would be desirable to have better termination conditions that do not rely on increased sampling as our heuristic did in the numerical experiments. Finally, it would be natural to investigate mesh refinement strategies as in [20]. For more involved choices of nonsmooth terms, the prox computation is also subject to numerical error and should be treated.
Funding Open Access funding enabled and organized by Projekt DEAL.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.

A Auxiliary results
To prove Lemma 3.17, we first need the following result.
, and therefore boundedness of v n for all n follows from (3.22) and vice versa. Supposing now that (3.22) holds, the fact that {v n } converges to a limit v ∞ follows by Proposition A.1. ◻

C Differentiability of expectation functionals
Let (X, ‖ ⋅ ‖ X ) be a Banach space and let J ∶ X × → ℝ be a random variable functional.
We summarize under what conditions we can exchange the integral and the derivative for the functional j ∶ X → ℝ , where j(u) = ∫ J(u, ) dℙ( ).
The following definition gives the minimal requirement for exchanging the derivative and expectation, namely, requiring J ∶ X × → ℝ to be L 1 -Fréchet differentiable.
Definition C.1 A p-times integrable random functional J ∶ X × → ℝ is called L p -Fréchet differentiable at u if for an open set U ⊂ X there exists a bounded and linear random operator A ∶ U × → ℝ such that lim h→0 ‖J (u + h) − J (u) + A(u, )h‖ L p ( ) ∕‖h‖ X = 0.
By Hölder's inequality, if u ↦ J(u, ⋅) is L p -differentiable and 1 ≤ r < p , then it is also L r -differentiable with the same derivative. This implies that j ∶ X → ℝ is Fréchet differentiable at u.
Proof By the mean value theorem, for h close enough to u, there exists a z within the neighborhood containing u + h and u that satisfies �J (u + h) − J (u)� ≤ ‖J � (z)‖ X * ‖h‖ X . Now, we have for almost every ∈ that By Assumption C.2(vii), C(⋅) is integrable, so by Lebesgue's dominated convergence theorem, it follows that where the last equality follows by Assumption C.2(vii). Now consider the mapping F ∶ h ↦ ∫ J � (u)h dℙ( ). It is straightforward to show that this is a bounded and linear operator. Therefore, we use Assumption C.2(vi) to get where the second equality holds by the triangle inequality and and (C.3). Therefore j is Fréchet differentiable at u with derivative F = ∫ J � (u) dℙ( ) . ◻