1 Introduction and problem statement

In the following, we define the set of Lebesgue integrable functions mapping from the space X to space Y by L1(X;Y ) and from the space X to the real numbers \(\mathbb {R}\) by L1(X). The “dot” notation is used in the following way: g(⋅,y) denotes the mapping xg(x,y).

Our goal in this article is to develop a novel stochastic gradient method for the efficient solution of optimization problems of the general form:

$$ \underset{u \in {U_{ad}}}{\min} F(u) = J(f(u,\cdot)). $$
(1.1)

Here, u is the design variable, which can be subject to constraints described by the set Uad, and \(F:{U_{\text {ad}}} \mapsto \mathbb {R}\) is given as a composition of a functional \(J : L^{1}(V_{\text {ad}}) \mapsto \mathbb {R} \) and a function \(f : {U_{\text {ad}}} \times V_{\text {ad}} \mapsto \mathbb {R}\), where Vad is a continuous parameter set. Throughout this paper, we further assume that the evaluation of the function f for any (u,v) ∈ Uad × Vad requires the solution of an underlying state problem, i. e., f is given in the form as follows:

$$ f(u,v) = j(u, y(u;v)), $$

where y(u;v) denotes the solution of the state problem parameterized by the design u and the additional continuous index variablevVad. As a consequence of this construction, an evaluation of the function F at a given design u theoretically requires the solution of infinitely many state problems.

In order to demonstrate that problem (1.1) has a broad range of applications, we give two particular examples for the choice of the functional J. In our first example, J is simply an integral over Vad, resulting in the problem as follows:

$$ \underset{u \in U_{\text{ad}}}{\min} {\int}_{V_{\text{ad}}} f(u,v) \mathrm{d}v. $$
(1.2)

The structure in (1.2) can arise in various settings. First, in an elastic setting, v could be a continuous load index and f(⋅,v) a compliance, displacement, or stress evaluation associated with the solution of the state problem with load index vVad. In this case, (1.2) would be a structural optimization problem with infinitely many load cases. For optimization problems with at least many load cases, we refer, e.g., Alvarez and Carrasco (2005) and Zhang et al. (2017), as well as the references therein. Furthermore, if for applications in acoustics (see, e.g., Dilgen et al. 2019) or optics (see, e.g., Jensen and Sigmund 2011), a state problem in time-harmonic form is considered, the parameter v can play the role of a frequency or wavelength. A prominent example for f in this context would be an L2-tracking function, such that (1.2) would describe the design of a device with a prescribed behavior over a continuous frequency range, such as the range of visible light, see Semmler et al. (2015). Another application in optics could be the optimization of an anisotropic object with respect to arbitrary illumination directions, again see Semmler et al. (2015). While we have so far looked at all these examples from a deterministic point of view, another important class of applications arises if we interpret F(u) as the expected value of a given property f associated with a design u. This immediately leads to the notion of reliability-based optimization problems (RBO), see, e.g., the overview article by Maute and Frangopol (2003), Conti et al. (2009) or De et al. (2019) and the references therein for a collection of more recent articles on this topic. In all these cases, the parameter v constitutes a realization of uncertainty (from a continuous uncertainty set Vad), where the source of uncertainty can be, e.g., in loading, material properties, or stiffness.

Following this line of argumentation a little further, we come to the second instantiation of the generic problem (1.1), which is referred to as robust structural optimization problem according to De et al. (2019) and takes the form as follows:

$$ \underset{u \in {U_{\text{ad}}}}{\min} \mathbb{E}[f(u,\cdot)] + \lambda \text{Var}(f(u,\cdot)). $$
(1.3)

Here, the expected value \(\mathbb {E}[f(u,\cdot )])\) is computed by the integral in (1.2), and λ is a positive parameter that denotes the importance of variations.

Having said this, we would like emphasize out that our paper is not the first one suggesting the application of stochastic-gradient-type methods for the solution of structural optimization problems of the aforementioned type. Zhang et al. (2017) use the classical stochastic gradient (SG) method for the efficient solution of structural optimization with many (but finitely many) load cases. In De et al. (2019), robust optimization problems in the form of (1.3) are solved by the SG method and variants, as well as a stochastic version of the well-known GCMMA framework, see Svanberg (2002). However, in this paper, a discrete set of scenarios is also assumed from the very beginning. More applications of the SG method can be found in the closely related area of inverse problems, see Haber et al. (2012) and Roosta-Khorasani et al. (2014). These are structurally similar to the problems considered in this article, in the sense that each evaluation of the function f for a given scenario v, and also requires the potentially expensive solution of a discretized partial differential equation (PDE).

However, there are at least two substantial differences between all these references and the approach we suggest in this paper. Firstly, even though in some cases in the aforementioned articles a continuous index set Vad is considered, an a priori selection of scenarios (or discretization of Vad) is used. Secondly, even though in many applications the property function f depends on the index variable v with a certain regularity, this structure is ignored. In sharp contrast to this, we avoid an a priori discretization of integrals in this paper. This is principally due to the fact that a too coarse discretization can lead to artificial minima as will be demonstrated by means of a simple example in Section 2. Moreover, we would like to exploit the natural regularity of the property function f with respect to the index parameter v in order to design an efficient optimization algorithm in which the objective function F and its gradient are increasingly better approximated.

Beyond stochastic descent methods, robust problems of type (1.3) can also be approached by a combination of stochastic collocation methods combined with deterministic optimization solvers, see, e.g., Lazarov et al. (2012). However, also in this case, through the collocation, an approximation of the objective functional based on finitely many scenarios is chosen a priori.

To the best knowledge of the authors, the SG method itself is the only method which can, in principle, be applied to structural optimization problems with infinitely many state problems without relying on an a priori approximation of the objective functional. However, as it will be shown in the article, only the CSG method is able to successfully solve these problems.

Despite this, there are substantial structural similarities between our CSG method and the classic SG method and its relatives. Therefore, in the following, we briefly describe the basic SG idea.

The original SG method, see Robbins and Monro (1951), is a method frequently used to minimize functions of the form \(F:{\mathbb {R}}^{p} \to {\mathbb {R}},\) with

$$ F(x) := \frac{1}{n}\sum\limits_{i = 1}^{n} f_{i}(x) $$

and \(f_{i}:= {\mathbb {R}}^{p} \mapsto {\mathbb {R}}\) for all i ∈{1,…,n}. Whereas conventional gradient methods calculate all n gradients ∇fi in each iteration, the stochastic gradient method uses only a small random selection thereof. Hereby, for large n (as, e.g., in machine learning applications, see Bottou and Cun 2004), the computational effort can be drastically reduced in comparison to the classical gradient method. Bottou et al. (2018) and Schmidt et al. (2017) introduced an improved version of the SG algorithm, the so-called stochastic average gradient method (SAG). This benefits from previously calculated gradients, leading to a better approximation of the exact gradient. Thus, a better convergence behavior is typically observed. In this paper, the basic properties of the SG and the SAG method will be combined.

  • Low computational effort per iteration

  • Reuse of previously obtained information

We will integrate these two properties in our CSG algorithm and compare it with the SG and the SAG method by means of examples.

The remainder of this paper is structured as follows: We will first close this section with the formal statement of the problem. In Section 2, we introduce the proposed CSG algorithm by which we aim to solve the types of problems discussed. In Section 3, we then analytically study the convergence of the algorithm and state assumptions necessary for convergence. The main theoretical results are given in the two Theorems 19 and 20. Section 4 provides first numerical results of the CSG algorithm, as well as a comparison with the SG and SAG methods. Finally, in Section 5, we provide a summary and a brief outlook for further scientific work.

1.1 Formal statement of the problem

Generally, we look at the objective functionals as defined in (1.1), and we further assume the following:

Assumption 1 (Objective functional)

The reduced objective functional \(F(u):{U_{\text {ad}}} \mapsto {\mathbb {R}}\) is given by the composition of a mapping \( J : L^{1}(V_{\text {ad}} ; {\mathbb {R}}) \mapsto {\mathbb {R}} \) and \(f : {U_{\text {ad}}} \times V_{\text {ad}} \mapsto {\mathbb {R}}\). The Fréchet derivative of J will be denoted by its associated function \(D_{ J} := L^{\infty }(V_{\text {ad}} ; {\mathbb {R}}) \times V_{\text {ad}} \mapsto {\mathbb {R}}\). DJ and ∇uf have to be Lipschitz continuous w.r.t. both their arguments in the respective topology.

For a definition of the Fréchet differential, see for instance (Jahn 2007, Def. 3.11.). With F chosen as in (1.1) and the latter assumption, we can state its derivative as follows:

Remark 2 (Derivative of F)

The derivative of F is then given by the following:

$$ \nabla F(u) = {\int}_{V_{\text{ad}}} D_{ J}(f(u,\cdot),v) \cdot \nabla_{u} f(u,v) \mathrm{d} v $$
(1.4)

for all uUad, with DJ being the Fréchet differential as defined in Assumption 1.

The Fréchet derivative mentioned is exemplified in two relevant cases:

Remark 3 (Examples for Frechet derivativeś)

If J(f(u,⋅)) is the expected value of f w.r.t. the second argument, we obtain the following:

$$ \begin{array}{@{}rcl@{}} D_{ J}(f(u,\cdot),v) &=1 \end{array} $$

and for J(f(u,⋅)) being the variance of f with respect to the second argument, we obtain for all vVad, as follows:

$$ \begin{array}{@{}rcl@{}} D_{J}(f(u,\cdot),v) &=&2(f(u,v)-\mathbb{E}[f(u,\cdot)])\\ &&+{\int}_{V_{\text{ad}}}2(f(u,w)-\mathbb{E}[f(u,\cdot)]) \mathrm{d} w. \end{array} $$

As already mentioned, in the case of structural optimization, for each parameter vVad, the evaluation of f(⋅,v) requires the solution of an associated state problem. Consequently, the (approximate) evaluation of F and its gradient given in (1.4) is computationally very expensive. In general, it can be stated that the algorithm introduced in Section 2 is especially attractive for problems in which F is numerically expensive to evaluate due to its functional dependency.

One advantage of our algorithm in comparison to the SG and the SAG method is that no (a priori) quadrature rule is used to approximate the integral in (1.4). In this way, we later gain convergence of a subsequence to a stationary point of the continuous problem (1.1). Moreover, this helps to avoid artificial local minima, as will be outlined in the next section.

2 Continuous stochastic gradient method

Before presenting the new optimization algorithm, we will briefly discuss how discretization of the objective functional can introduce local minima, looking at the following function as follows:

$$ F(u) := {\int}_{-1}^{1} \frac{v^{2}}{(u-v)^{2}+10^{-3}} \mathrm{d}v. $$
(2.1)

By choosing \({U_{\text {ad}}} = [-\frac {1}{2},\frac {1}{2}]\) and Vad = (− 1,1), this corresponds to problem (1.1).

In Fig. 1, the graph of the function F in (2.1) is shown along with numerical approximations based on equidistant discretizations of the integral. Although the original function is convex, local minima are introduced due to the discretization of the integral. It is noted that without much information on the function f, it is hard to choose a suitable discretization, which would avoid this effect. Nevertheless, a good algorithm should be able to prevent convergence to such an artificial local minimum. In fact, this is one of the key features of the CSG algorithm introduced in detail in the following section.

Fig. 1
figure 1

The analytic function F (blue) and the function F discretized with 4 (red), 8 (yellow), 16 (green), and 32 (purple) equidistant discretization points. F is given in (2.1)

2.1 The CSG algorithm

With \(\mathcal {P}_{U_{\text {ad}}}\) being the orthogonal projection onto Uad, \(\lambda ^{d_{v}}\) being the Lebesgue measure in the dv-dimensional space, and

$$ {M}_{i,j}^{n} := \left\{v\in V_{\text{ad}} : \left\|\binom{u_{n}-u_{i}}{v-\omega_{i}}\right\|_{*} < \left\|\binom{u_{n} - u_{j}}{v-\omega_{j}}\right\|_{*} \right\} $$

being the set of points that are closer to (ui,ωi) than they are to (uj,ωj) in the ∥⋅∥-norm given in Definition 10, we can state the proposed Algorithm 1:

figure a

Note that \({({{V}_{i}^{n}})}_{i = 1}^{n}\) defined in Algorithm 1 is a partition of Vad for all \(n\in \mathbb {N}\), i.e., \({{V}_{i}^{n}} \cap {{V}_{j}^{n}} = \emptyset \) for all i,j ∈{0,…,n} with ij and \(V_{\text {ad}} = {\cup }_{i=0}^{n} \bar {{V}_{i}^{n}}\).

The CSG method as defined in Algorithm 1 is suitable for problems of the form (1.1) and is structured as most gradient descent methods. In each iteration n, a search direction \(\hat G_{n}\) (an approximation of the gradient ∇Fn := ∇F(un)) is calculated, a step length τn > 0 is chosen and a sequence \((u_{n})_{n\in \mathbb {N}}\) is generated by the following:

$$ u_{n+1} := \mathcal{P}_{U_{\text{ad}}}\left( u_{n} - \tau_{n} \hat G_{n}\right). $$

Note that the existence and uniqueness of \(\mathcal {P}_{U_{\text {ad}}}\) is guaranteed by the projection theorem (see, e.g., Aubin (2000)) and for all \(w \in \mathbb {R}^{d_{u}}\) defined by the following:

$$ \mathcal{P}_{U_{\text{ad}}}(w) := \underset{u\in{U_{\text{ad}}}}{\arg\min} \left\|u - w\right\|. $$

In this contribution, we use the following abbreviations: Fn := F(un) and ∥⋅∥ denote the euclidean norm in the respective dimensions. The distinctive feature of the algorithm lies in the calculation of the search direction \(\hat G_{n}\). In each iteration n, the gradient ∇uf(un,⋅) is evaluated at a random position ωnVad and stored for later iterations. The search direction \(\hat G_{n}\) is in principle a linear combination of the former gradients \(g_{i}:=\nabla _{u} f(u_{i},\omega _{i}) , i= 0,\dots ,n\) with weights \({\alpha _{i}^{n}}, i=0,\dots ,n\). To provide an idea how the weights are calculated, we refer to the sketch in Fig. 2. There, \(\omega _{0},\dots ,\omega _{10} \in V_{\text {ad}}\) are randomly sampled points and \(g_{0},\dots ,g_{10}\) the corresponding gradients. Then, the approximate gradient is given as \(\hat G_{10} = {\sum }_{k=0}^{10} a_{k}g_{k}\), where \(a^{10}_{0},\ldots ,a^{10}_{10}\) are the lengths of the line segments associated with the points (ω0,u0),…,(ω10,u10). Here, the assignment of segments to points is indicated by the same color. The underlying structure is the Voronoi diagram of the points (ωk,uk)k∈{1,…,10} (see, e.g., Fortune (1995)). More formally, the weights \(\alpha ^{10}_{k}\) can be defined for all k ∈{0,…,10} as the dv-dimensional measure of Ωk ∩ (u10 × Vad) where ΩkVad × Uad is the Voronoi face associated with the point (ωk,uk).

Fig. 2
figure 2

Example for the calculation of the approximated gradient \(\hat G_{n}\) in Algorithm 1 in the case of \(V_{\text {ad}},{U_{\text {ad}}} \subset \mathbb {R}\)

We note that the computational complexity per iteration is given by the evaluation of the gradient of the function f and the calculation of the weights \({{\alpha }_{0}^{n}},\dots ,{{\alpha }_{n}^{n}}\). Up to the calculation of the weights, this is analogous to the SG and SAG method. It should also be noted that all the gradients \(g_{0},\dots ,g_{n-1}\) of the previous iterations are included in the current iteration. In Section 3, we will show that the error \(\|\hat G_{n}- \nabla F_{n} \|\) almost sure converges to zero. Hence, the algorithm does not become trapped in a local minimum of the discretized function. Therefore, the problem described in Fig. 1 will not arise, since the approximation of ∇Fn by \(\hat G_{n}\) will be better and better.

3 Convergence analysis

In this section, we will study the convergence of the proposed algorithm. Due to the randomly chosen evaluation point within the algorithm, we will have to study probabilistic convergence behavior in terms of “almost sure convergence” and convergence in expectation. This notion of convergence as well as further assumptions and definitions are given in Section 3.1. In Section 3.2, we prove that the error in the gradient approximation goes to zero and finally apply this result in Section 3.3 to prove convergence of the CSG method.

3.1 Assumptions, definitions, and preliminary results

For the convergence analysis of Algorithm 1, the following three assumptions on the objective functional, the step length, and the sets Uad,Vad are an important ingredient. In the following, we will assume that these Assumptions are always satisfied without mentioning it explicitly.

Definition 4 (Lipschitz constants and maxima)

We will denote the Lipschitz constants of DJ and ∇uf with respect to both their arguments by \({L}_{^{D_{ J}}}^{_{(1)}},{L}_{^{D_{ J}}}^{_{ (2)}}\) and by \({L}_{^{\nabla _{ u} f}}^{_{ (1)}},{L}_{^{\nabla _{ u} f}}^{_{ (2)}}\). Their maximal absolute function values will be defined as \(C_{^{D_{ J}}},C_{^{\nabla _{ u} f}}\).

For Uad and Vad, we assume the following:

Assumption 5 (Regularity ofUadandVad)

The set \({U_{\text {ad}}} \subset \mathbb {R}^{d_{u}}\) is compact and convex. \(V_{\text {ad}} \subset \mathbb {R}^{d_{v}}\) is open and bounded. In addition, there exists \(c\in \mathbb {R}\) s.t. \(\left |V_{\text {ad}} \setminus {V}_{\text {ad}}^{r} \right | \leq \text {cr} \ \forall r\in (0,1)\), with \({V}_{\text {ad}}^{r} := \{x\in V_{\text {ad}} : B_{r}(x) \subset V_{ad}\}\) and where \(B_{r}(x) \subset \mathbb {R}^{d_{v}} \) is an open ball centered in \(x\in \mathbb {R}^{d_{v}}\) with radius r.

The latter assumption is fulfilled for non pathological open sets with finite perimeter.

Assumption 6 (Step length)

The step length \((\tau _{n})_{n \in \mathbb {N}}\) satisfies the following: \(\exists N \in \mathbb {N}\), \(c_{1},c_{2}\in \mathbb {R}_{>0}\), and \(\delta \in \left (0,\frac {1}{\max \limits \{d_{v},2\}}\right )\) s.t.

$$ c_{1} n^{-1}\leq \tau_{n} \leq c_{2} { n^{-1+\frac{1}{\max\{d_{v},2\}}-\delta}} \quad \forall n \in \mathbb{N}_{>N}. $$

These conditions on the step length satisfy the conditions stated in Robbins and Monro (1951, Eqns. (6) and (26)) as well as equivalently in Bottou et al. (2018, Eqn. (4.19) in the one-dimensional case, and can be seen as a higher dimensional equivalent.

Remark 7 (Step length for dv = 1 and dv = 2)

In case of a one- or two-dimensional set Vad, Assumption 6 is satisfied iff

$$ \tau_{n} \in \mathcal{O}(n^{-1})\cap o(n^{-\frac{1}{2}}) $$

with the Big Oh and Little Oh notation as defined in Bürgisser and Cucker (2013). In other words, the null series \((\tau _{n})_{n\in \mathbb {N}}\) must not tend faster to zero than \((n^{-1})_{n\in \mathbb {N}}\) but not as slow as \((n^{-\frac {1}{2}})_{n\in \mathbb {N}}\).

The lower bound for the stepsizes ensure that the accumulated stepsizes reach infinity and the algorithm does not get stuck due to their reduction. The upper bound ensures that the approximation of the search direction is appropriate. This is equivalent to the rate of convergence for empirical measures, see, e.g., Dudley (1969, Prop. 3.4.).

Despite these assumptions in the stepsize, in Theorem 20, a result will be stated for the case of a step length bounded away from zero.

To show convergence of the algorithm, we must first state the probability space setting.

Definition 8 (Probability space setup)

The probability space \(({\varOmega },\mathcal {A},\mathbb {P})\) is given by the following:

$$ \begin{array}{@{}rcl@{}} {\varOmega} &:=& {V}_{\text{ad}}^{\mathbb{N}}, \quad \mathbb{P} := \mu^{\otimes \mathbb{N}}\\ \mathcal{A} &:=& \sigma \{A_{1}\times \ldots \times A_{n}: A_{i} \in \mathcal{B}(V_{\text{ad}}), \forall i, n \in \mathbb{N}\}, \end{array} $$

where \(\mu ^{\otimes \mathbb {N}} (A_{1}\times {\ldots } \times A_{n}) = {\prod }_{i = 1}^{n} \frac {\mu (A_{i})}{\mu (V_{\text {ad}})} \) is the product measure and \(\mu = \lambda ^{d_{v}}\) the Lebesgue measure in \(\mathbb {R}^{d_{v}}\).

All the following random variables are defined in this setting. For the convergence of random variables, we use the following commonly used notation:

Definition 9 (Stochastic convergence)

A sequence of random variables \((Z_{n})_{n\in \mathbb {N}}\) converges almost surely to some random variable Z iff

$$ \mathbb{P}\left( \{\omega \in {\varOmega} : \underset{n\to\infty}{\lim}Z_{n}(\omega) = Z(\omega) \}\right) = 1, $$

which we denote by Zna.s.Z.

In this document, we define the following norm on the product space Uad × Vad.

Definition 10 (Norm inUad × Vad)

For better readability, we define the following 1/2-norm on the product space Uad × Vad:

$$\left\| \binom{u}{v} \right\|_{*} := \|u\|_{2}+\|v\|_{2}.$$

Due to norm equivalence in finite dimensional spaces, the results presented later hold for all chosen norms in Uad and Vad and combinations thereof.

The orthogonal projection used in Algorithm 1 has some important properties:

Lemma 11 (Orthogonal projection)

Let \(S\subset \mathbb {R}^{n}\) for \(n\in \mathbb {N}_{>0}\) be closed and convex and let \(\mathcal {P}_{S}\) be the orthogonal projection, then the following holds for all \(x,y \in \mathbb {R}^{n}\) and zS:

  1. a)

    \((\mathcal {P}_{S}(x) - x)^{T} (\mathcal {P}_{S}(x) - z) \leq 0\),

  2. b)

    \( (\mathcal {P}_{S}(y) - \mathcal {P}_{S}(x))^{T}(y-x) \geq \|\mathcal {P}_{S}(y) - \mathcal {P}_{S}(x)\|^{2} \geq 0\),

  3. c)

    \(\|\mathcal {P}_{S}(y) - \mathcal {P}_{S}(x)) \| \leq \|y-x\|\).

Proof

(a) is (ii) in Aubin (2000, Thm. 1.4.1) and (b), and (c) are (iii) and (ii) in Aubin (2000, Prop. 1.4.1). □

For hC1(Uad) and Uad convex, the following sufficient conditions for first-order optimality are equivalent:

Corollary 12 (Optimality conditions)

For all uUad, the following items are equivalent:

  1. i)

    −∇h(u)T(uu) ≤ 0 ∀uUad,

  2. ii)

    \(\mathcal {P}(u^{*} -t\nabla h(u^{*})) = u^{*} \quad \forall t \geq 0\).

We call u satisfying one of the above conditions a stationary point.

Proof

Define for uUad the cone

$$ N_{{U_{\text{ad}}}}(u) := \{x \in \mathbb{R}^{d_{u}}: x^{T}({\bar{u}}-u) \le 0\ \forall \bar u \in {U_{\text{ad}}}\}. $$

Using Lemma 11 ((i) for “⇒” and (ii) for “⇐”), it is straightforward to see that for \(x \in \mathbb {R}^{d_{u}}\) and uUad,

$$ {\mathcal{P}}_{{U_{\text{ad}}}} (x) = u \quad \Leftrightarrow \quad (x-u) \in {N}_{{U_{\text{ad}}}}(u). $$

Since \(-\nabla h \in N_{{U_{\text {ad}}}}(u^{*})\), the result follows. □

3.2 Error in the approximate gradient

In this subsection, we analyze the error in the n th iteration of the approximate gradient \(\hat G_{n}\) and the gradient of the objective functional ∇Fn. To do this, we define for vVad,ωΩ the sequence of random variables \(\left (X_{n}\right )_{n\in \mathbb {N}}\) by the following:

$$ X_{n}(\omega;v):= \underset{k = 1,\dots, n}{\min}\left\| \binom{u_{k}(\omega)-u_{n}(\omega)}{\omega_{k}-v} \right\|_{*} . $$

Lemma 13 (Convergence result)

For vVad,

$$ \sum\limits_{n = 1}^\infty \mathbb{P}\left( X_n(\cdot ; v) > \varepsilon_{n} \right) < \infty, $$

where \(\varepsilon _{n} := 2C_{^{\nabla _{u} f}} c_{2} |V_{ad}| n^{-\frac {\delta }{2}} + \tilde {\varepsilon }_{n}\) and \(\tilde {\varepsilon }_{n} := n^{\frac {\delta }{2}-\frac {1}{\max \limits (2,d_{v})}}\) with c2 and δ defined in Assumption 6 and \(C_{^{\nabla _{u} f}}\) in Definition 4. Moreover,

$$ \sum\limits_{n = 1}^{\infty} \sup\limits_{v \in V_{ad}^{\varepsilon_{n}} } \mathbb{P}\left( X_n(\cdot; v) > \varepsilon_{n} \right) < \infty, $$

with \(V_{ad}^{\varepsilon _{n}}\) defined in Assumption 5.

Proof

By item (iii) in Lemma 11 we have

$$ \begin{array}{@{}rcl@{}} &&\mathbb{P}(X_{n}(\cdot;v) \geq \varepsilon_{n}) \\ &&\leq \mathbb{P}\left( \underset{k = i_{0} ,\dots, n}{\min}\left\| \left( \begin{array}{c}u_{k}(\omega)-u_{n}(\omega)\\ \omega_{k}-v\end{array}\right)\right\|_{\ast} \geq \varepsilon_{n} \right)\\ &&\leq \mathbb{P}\left( \sum\limits_{i=i_{0}}^{n-1}\|\tau_{i} \hat{G}_{i} \| + \underset{k = i_{0} ,\dots, n}{\min} \|\omega_{k} - v\| \geq \varepsilon_{n} \right) \\ &&\leq \mathbb{P}\left( C_{^{\nabla_{u} f}} c_{2} |V_{ad}|\sum\limits_{i=i_{0}}^{n-1} \frac{1}{i^\kappa} + \underset{k = i_{0},\dots, n}{\min} \|\omega_{k} - v\| \geq \varepsilon_{n} \right) \end{array} $$

where i0 := ⌈nan + 1⌉ and κ ∈ (0,1) given by \(\kappa := 1-\frac {1}{\max \limits \{d_v,2\}}+\delta \). For dv = 1, we choose \(a_{n} = \sqrt {n}\) and if dv ≥ 2, we choose \(a_{n} = n^{1-\frac {\delta }{2}}\). Observe that for n > 2 we obtain

$$ \begin{array}{@{}rcl@{}} &&{}\sum\limits_{i=i_{0}}^{n-1} \frac{1}{i^{\kappa}} \le {\int}_{i_{0}-1}^{n} \frac{1}{s^{\kappa}} \mathrm{d} s = \frac{1}{1-\kappa}\cdot \left( n^{1-\kappa} - (\lceil n-a_n\rceil)^{1-\kappa} \right) \\ &&{}\le \frac{1}{1-\kappa} \cdot \left( n^{1-\kappa} - (n - a_{n})^{1-\kappa} \right) = \frac{1}{1 - \kappa} \cdot \left( \frac{n}{n^{\kappa}} - \frac{n - a_{n}}{(n - a_{n})^{\kappa}} \right)\\ &&{}= \frac{1}{1-\kappa}\cdot \left( \frac{n^{\kappa} (1-\frac{a_{n}}{n})^{\kappa} - n^{\kappa}}{n^{\kappa} \cdot (n-a_{n})^{\kappa}} \right) + \frac{a_n}{(1-\kappa)(n-a_{n})^{\kappa}}\\ &&{}\text{applying Bernoulli's inequality in the first term}\\ &&{}\le \frac{1}{1-\kappa} \cdot \left( \frac{-\kappa a_{n}}{(n - a_{n})^{\kappa}} \right) + \frac{a_{n}}{(1 - \kappa)(n - a_{n})^{\kappa}} = \frac{a_{n}}{n^{\kappa}}(1 - \tfrac{a_{n}}{n})^{-\kappa} \end{array} $$

As \(\frac {a_{n}}{n} = n^{\frac {\delta }{2}-\frac {1}{\max \limits \{d_{v},2\}}} \leq n^{-\frac {1}{2\max \limits \{d_{n},2\}}}\) we obtain

$$ \sum\limits_{i=i_{0}}^{n-1}\|\tau_{i} \hat{G}_{i} \| \leq \frac{C_{^{\nabla_{u} f}} c_{2} |V_{ad}|}{1- 2^{-\frac{1}{2\max\{d_{n},2\}}}} n^{-\frac{\delta}{2}} = \varepsilon_{n}. $$
(3.1)

For all vVad there exists n large enough such that \(B_{\tilde {\varepsilon }_{n}}(v) \subset V_{ad}\). Hence,

$$ \begin{array}{@{}rcl@{}} &&\mathbb{P}(X_{n}(\cdot;v) \geq \varepsilon_{n}) \leq \mathbb{P}\left( \underset{k = i_{0} ,\dots, n-1}{\min} \|\omega_{k} - v\| \geq \tilde \varepsilon_{n} \right) \\ &&= \mathbb{P}\left( \|\omega_{k} - v\| \geq \tilde{\varepsilon}_{n} \quad \forall k\in \left\{i_{0} ,\dots, n-1\right\}\right)\\ &&=\prod\limits_{k=i_0}^{n-1} \mathbb{P}\left( \|\omega_{k} - v\| \geq \tilde \varepsilon_{n} \right) =\prod\limits_{k=i_{0}}^{n-1} \left( 1 - \frac{|B_{\tilde{\varepsilon}_{n}}(v)|}{|V_{ad}|} \right) \\ &&\le\left( 1 - \frac{|B_{\tilde{\varepsilon}_{n}}(v)|}{|V_{ad}|} \right)^{ a_{n}} = \left( 1 - \nu \cdot n^{-\frac{d_{v}}{\max(2,d_{v})}+ \frac{d_{v} \delta}{2}}\right)^{ a_{n}}, \end{array} $$

with \(\nu := \frac {\pi ^{\frac {d_{v}}{2}}}{\Gamma (\frac {d_{v}}{2}+1)|V_{ad}|}\). Thus

$$ \mathbb{P}(X_{n}(\cdot;v) \geq \varepsilon_{n}) \leq \left( 1 - \nu\cdot \frac{n^{\frac{\delta}{2}}}{a_{n}}\right)^{ a_{n}}. $$

By the limit comparison test, the corresponding series converge. Finally, note that Assumption 5 gives that \(v \in V_{ad}^{\varepsilon _{n}}\) implies \(B_{\varepsilon _{n}}(v) \subset V_{ad}\) and therefore \(\displaystyle \sup \limits _{v \in V_{ad}^{\varepsilon _{n}}} \mathbb {P}(X_{n}(\cdot ;v) \geq \varepsilon _{n}) \le \left (1 - \nu \cdot \frac {n^{\frac {\delta }{2}}}{a_{n}}\right )^{ a_{n}}\).

As a direct consequence of the latter result, we obtain almost sure convergence:

Corollary 14 (Density ofωinVad)

For all vVad

$$ X_{n}(\cdot ; v) \longrightarrow{a.s.} 0 \quad\text{for}\quad n\to\infty. $$

Proof

The result follows by Lemma 1 and the Borel-Cantelli Lemma, see, e.g., Klenke (2013, Thm. 6.12). □

Thus, due to the Lipschitz continuity of ∇uf, and DJ the integral in ∇F(un) is increasingly is increasingly better approximated by \(\hat G_{n}\) for \(n \rightarrow \infty \):

Corollary 15 (Error in gradient approximation)

The norm of the difference between approximate gradient \(\hat G_{n}\) in the n th iteration (defined in Algorithm 1) and the gradient of the exact objective functional ∇F in un goes to zero, i.e.,

$$ \| \hat{G}_{n} - \nabla F_{n}\| \longrightarrow{a.s.} 0, \quad \underset{n\to\infty}{\lim} \mathbb{E}\left[\|\hat{G}_{n} - \nabla F_{n}\| \right] = 0 $$

and

$$ \sum\limits_{n = 0}^{\infty} \tau_{n} \mathbb{E}\left[ \|\hat{G}_{n} - \nabla F_{n} \|\right] < \infty. $$
(3.2)

Proof

For vVad, define

$$ k^{n}(v) := \underset{{k = 1,\ldots,n}}{\arg \min} \left\{\left\|\binom{u_{k}(\omega)-u_{n}(\omega)}{\omega_{k}-v}\right\|_{*} \right\}. $$

Then,

$$ \hat G_{n} = {\int}_{V_{\text{ad}}} D_{ J}(\hat f(u_{n},\cdot),\omega_{k^{n}(v)})\nabla_{u} \hat f(u_{n},v) \mathrm{d} v, $$

where \(\hat f(u_{n},v) := f(u_{k^{n}(v)},\omega _{k^{n}(v)})\). By the Lipschitz continuity assumed in Assumption 1, we therefore obtain the following:

$$ \begin{array}{@{}rcl@{}} \| \hat{G}_{n} - F_{n}\| &\leq& \left( C_{^{\nabla_{ u} f}}{L}_{^{D_{ J}}}^{_{(1)}}\max\{L_{^{f}}^{_{ (1)}},{L}_{^{f}}^{_{ (2)}}\} + C_{^{\nabla_{ u} f}}{L}_{^{D_{ J}}}^{_{ (2)}}\right.\\ && \left. + C_{^{D_{ J}}} \max\{{L}_{^{\nabla_{ u} f}}^{_{ (1)}},L_{^{\nabla_{ u} f}}^{_{ (2)}}\}\right) {\int}_{V_{\text{ad}}} X_{n}(\omega;v) \mathrm{d} v, \end{array} $$

with constants defined in Definition 4 and Xn(⋅;v) as defined in Lemma 1. Recall that Uad, Vad are bounded. Now, the almost sure convergence, as well as the convergence of the expectations, is followed by Lebesgue’s dominated convergence result.

Finally, let C be a generic constant and ε > 0. Since \(\sup _{v \in V_{\text {ad}}} X_{n}(\cdot ;v) \le D < \infty \), where D := diam(Vad) + diam(Uad) denotes the diameter of Vad plus the diameter of Uad, and by Fubini’s theorem, we have the following:

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}\left[\|\hat{G}_{n} - \nabla F_{n} \|\right]\leq C \mathbb{E}\left[{\int}_{V_{\text{ad}}} X_{n}(\cdot;v) \mathrm{d}v \right] \\ &\leq& C {\int}_{V_{\text{ad}}} \varepsilon \mathbb{P}(X_{n}(\cdot;v) \leq \varepsilon) + 2 D \mathbb{P}(X_{n}(\cdot;v) > \varepsilon) \mathrm{d}v \\ &\leq& C \left( \varepsilon + {\int}_{V_{\text{ad}}\setminus V_{\text{ad}}^{\varepsilon}} \mathbb{P}(X_{n}(\cdot;v) > \varepsilon) \mathrm{d}v + {\int}_{V_{\text{ad}}^{\varepsilon}} \mathbb{P}(X_{n}(\cdot;v) > \varepsilon) \mathrm{d}v\right)\\ &\le& C \left( 2\varepsilon + \underset{v \in V_{\text{ad}}^{\varepsilon}}{\sup}\mathbb{P}(X_{n}(\cdot;v) > \varepsilon)\right), \end{array} $$

where \({V}_{\text {ad}}^{r}\) is given in Assumption 5. If we choose \(\varepsilon = \varepsilon _{n} = 2 C_{^{\nabla _{ u} f}} c_{2}\cdot n^{-\frac {1}{2}} + n^{-\frac {1}{\max \limits (2,d_{v})}+\frac {\delta }{2}}\) as in Lemma 1, we obtain the following:

$$ \begin{array}{@{}rcl@{}} &&\sum\limits_{n = 1}^{\infty} \tau_{n} \mathbb{E}\left[\|\hat{G}_{n} - \nabla F_{n} \|\right] \\ &&\le C \left( \sum\limits_{n = 1}^{\infty} \frac{1}{n^{1+\delta}} + \frac{1}{n^{1 + \frac{\delta}{2}}} + \underset{v \in {V}_{\text{ad}}^{\varepsilon_{n}}}{\sup} \mathbb{P}(X_{n}(\cdot;v) > \varepsilon_{n}) \right) < \infty, \end{array} $$

which concludes the proof. □

3.3 Convergence result

As we have seen in Corollary 14, the error \(\|\hat {G}_{n} - \nabla F_{n}\| \) converges almost surely and in expectation to zero for \(n\rightarrow \infty \). It remains to provide sufficient conditions under which the algorithm converges to a stationary point.

Lemma 16 (Objective functional values)

The difference of the objective functional values in iteration \(n\in \mathbb {N}\) can be approximated as follows:

$$ F_{n+1} - F_{n} \leq -\frac{1}{\tau_{n}} \|u_{n+1} - u_{n}\|^{2} + \phi_{n}, $$

with \(\phi _{n} := \tau _{n} \|\nabla F_{n}-\hat {G}_{n} \| \cdot \|\hat {G}_{n}\| + {\tau _{n}^{2}} C\|\hat {G}_{n}\|^{2}.\)with a constant \(C \in \mathbb {R}_{\geq 0}\) depending on the lipschitz constants and suprema of the involved functions.

Proof

By the mean value theorem, there is a c ∈ (0,1) such that (we set \(\nabla {{F}_{n}^{c}} := \nabla F((1-c) u_{n} + c u_{n+1}\))

$$ \begin{array}{@{}rcl@{}} F_{n+1} - F_{n} &=& (\nabla {{F}_{n}^{c}})^{T}\left( u_{n+1}- u_{n}\right) \\ &=& \nabla {{F}_{n}^{T}}\left( u_{n+1}- u_{n}\right) + (\nabla {{F}_{n}^{c}} - \nabla F_{n})^{T}\left( u_{n+1}- u_{n}\right) \\ &\leq& \nabla {{F}_{n}^{T}}\left( u_{n+1}- u_{n}\right) + C \|u_{n+1}-u_{n}\|^{2}, \end{array} $$

using the Cauchy–Schwartz inequality and Definition 4. Recall that \(u_{n+1} = P_{U_{\text {ad}}}(u_{n} - \tau _{n} \hat {G_{n}})\). With Lemma 11 (b), (c), and the Cauchy–Schwartz inequality for the first term of the right-hand side of the latter equation, we obtain the following:

$$ \begin{array}{@{}rcl@{}} &&\nabla {{F}_{n}^{T}} \left( \mathcal{P}_{U_{\text{ad}}}(u_{n} - \tau_{n} \hat{G_{n}}) - u_{n}\right) \\ &=& \hat {G_{n}^{T}}\left( \mathcal{P}_{U_{\text{ad}}}(u_{n} - \tau_{n} \hat{G_{n}})- u_{n}\right) \\ &&+ (\nabla F_{n} - \hat G_{n})^{T}\left( \mathcal{P}_{U_{ad}}(u_{n} - \tau_{n} \hat{G_{n}})- u_{n}\right)\\ &\leq& -\frac{1}{\tau_{n}}(u_{n} - \tau_{n}\hat G_{n} -u_{n})^{T} \cdot \left( \mathcal{P}_{U_{ad}}(u_{n} - \tau_{n} \hat{G_{n}})- u_{n}\right)\\ &&+ \|\nabla F_{n} - \hat G_{n}\|\|\mathcal{P}_{U_{ad}}(u_{n} - \tau_{n} \hat{G_{n}}) - u_{n}\|\\ &\leq& -\frac{1}{\tau_{n}}\|u_{n+1} - u_{n}\|^{2} + \tau_{n} \|\nabla F_{n} - \hat G_{n}\| \|\hat G_{n}\|. \end{array} $$

Applying Lemma 11 (c) to the second term yields \( \|\mathcal {P}_{U_{\text {ad}}}(u_{n} - \tau _{n} \hat {G_{n}}) - u_{n}\|^{2}\leq {\tau _{n}^{2}} \|\hat G_{n}\|^{2}. \)

Since the first term in right-hand side of the estimate in the above Lemma is strictly negative while the second term is strictly positive, we may expect a descent, provided \(\mathbb {E}\left [\phi _{n}\right ]\) is small enough.

Corollary 17 (Convergence result)

We have the following:

$$\sum\limits_{n = 0}^{\infty} \mathbb{E}\left[ \phi_{n}\right] < \infty,$$

where \(\phi _{n} := \tau _{n} \|\nabla F_{n}-\hat {G}_{n} \| \|\hat {G}_{n}\| + {\tau _{n}^{2}}\frac {C_{\nabla _{u} f}}{2}\|\hat {G}_{n}\|^{2}.\)

Proof

Since \(\|\hat G_{n}\|\) is bounded and \(\sum \limits _{n = 1}^{\infty } {\tau _{n}^{2}} < \infty \) by Assumption 6, the result follows by (3.2). □

Before we present our main results, we need the following auxiliary result:

Lemma 18 (Projection of gradient steps)

$$ \|P_{{U_{\text{ad}}}}(u_{n} - t \hat G_{n}) -u_{n}\| \leq \frac{t}{\tau_{n}}\|u_{n+1}-u_{n}\|\quad t \geq 0. $$

Proof

Define \(x:=u_{n},y:=\hat G_{n}\), and τ := τn. We assume that xτyUad (otherwise the result follows by Lemma 11) and set \(n_{\tau }:= x-\tau y - \mathcal {P}_{U_{\text {ad}}}(x-\tau y)\) and

$$ H:=\{u\in\mathbb{R}^{d_{u}} \ | \ u^{T} n_{\tau} \leq \mathcal{P}_{U_{\text{ad}}}(x-ty)^{T} n_{\tau} \}. $$

Since Uad is convex, we have UadH, and therefore \(\forall u\in \mathbb {R}^{d_{u}}\) by Lemma 11,

$$ \|u-x\| \geq \|\mathcal{P}_{H}(u) - x\|\geq \|\mathcal{P}_{U_{\text{ad}}}(u) - x\|, $$
(3.3)

where \(\mathcal {P}_{H}\) is the orthogonal projection onto H (compare Fig. 3). With \(B:= \frac {t}{\tau }\left (\mathcal {P}_{U_{\text {ad}}}(x-\tau y)-x\right ) + x\) and (3.3), we have the following:

$$ \begin{array}{@{}rcl@{}} &&\frac{t}{\tau}\|\mathcal{P}_{U_{\text{ad}}}(x-\tau y) - x\| \geq \|\mathcal{P}_{H}(B)-x\|\\ &=& \|\mathcal{P}_{H}(x-ty)-x\| \geq \|\mathcal{P}_{U_{\text{ad}}}(x-\text{ty})-x\|. \end{array} $$

Fig. 3
figure 3

Illustration of the intercept theorem for the proof of Lemma 17

Recalling the characterization of stationary points from Corollary 12, we obtain our first main result:

Theorem 19 (Convergence result)

For all t ≥ 0,

$$ \sum\limits_{n=0}^{\infty} \tau_{n} \mathbb{E}\left[\|\mathcal{P}_{U_{\text{ad}}}(u_{n} - t \nabla F_{n}) -u_{n}\|^{2}\right] < \infty. $$

Proof

First, note that by the compactness of Uad and regularity of F, \(F_{\inf } := \inf _{u \in {U_{\text {ad}}}} F(u) > -\infty \). Summing both sides of the inequality in Lemma 16 up to an \(N\in \mathbb {N}\) gives the following:

$$ \begin{array}{@{}rcl@{}} F_{\inf } - F_{0} &\leq& \mathbb{E}\left[F_{N+1}\right] - F_{0} = \sum\limits_{n = 0}^{N} \mathbb{E}\left[F_{n+1} - F_{n}\right]\\ &\leq& \sum\limits_{n = 0}^{N}\left( -\frac{1}{\tau_{n}} \mathbb{E}\left[ \|u_{n+1} - u_{n}\|^{2} \right] + \mathbb{E}\left[\phi_{n}\right] \right). \end{array} $$

Hence, by Corollary 17,

$$ \sum\limits_{n = 0}^{\infty} \frac{1}{\tau_{n}} \mathbb{E}\left[ \|u_{n+1} - u_{n}\|^{2} \right] \leq F_{0} - F_{\inf } + {\sum}_{n = 0}^{\infty} \mathbb{E}\left[\phi_{n}\right]< \infty. $$

Using Lemma 11 (ii) together with Lemma 18 gives for all \(n \in \mathbb {N}\) sufficiently large,

$$ \begin{array}{@{}rcl@{}} && \|\mathcal{P}_{U_{\text{ad}}}(u_{n} - t \nabla F_{n}) -u_{n}\|^{2} \\ &\leq& \left( \|\mathcal{P}_{U_{\text{ad}}}(u_{n} - t \hat G_{n}) -u_{n}\|\right. \\ &&+ \left.\|\mathcal{P}_{U_{\text{ad}}}(u_{n} - t \nabla F_{n}) - \mathcal{P}_{U_{\text{ad}}}(u_{n} - t \hat G_{n})\|\right)^{2} \\ &\leq& \left( \frac{t}{\tau_{n}}\|u_{n+1} -u_{n}\| + t\|\hat G_{n} - \nabla F_{n}\|\right)^{2}. \end{array} $$

Since \(\|u_{n+1} - u_{n}\| \le \tau _{n} \| \hat G_{n}\|\), this combined with Corollary 15 gives the result. □

As a direct consequence, we have the following:

Theorem 20 (Main theorem)

Let \((u_{n})_{n\in \mathbb {N}}\) be generated by Algorithm 1. Then, there exists a sub-sequence \((u_{n_{k}})_{k\in \mathbb {N}}\) converging to a stationary point, i.e.,

$$ \underset{n\to\infty}{\liminf}\ {\mathbb{E}}\left[\|\mathcal{P}_{U_{\text{ad}}}(u_{n} - t \nabla F_{n}) -u_{n}\|^{2}\right] = 0 \quad \forall t >0. $$

Proof

Direct consequence of Theorem 19. □

For applications, the condition on the step length (Assumption 6) is inconvenient, since the step length becomes very small and the algorithm thus progresses only slowly. If the algorithm is performed with a constant stepsize, and if the sequence \((u_{n})_{n\in \mathbb {N}}\) converges to some uUad, then the following theorem demonstrates that u is a stationary point.

Theorem 21 (Convergent series)

Assume the time-step series \((\tau _{n})_{n\in \mathbb {N}}\) satisfies \(\tau _{n} \geq \tau \quad \forall n \in \mathbb {N} \) for some τ > 0. Let further \((v_{n})_{n\in \mathbb {N}}\) be dense in Vad and assume \((u_{n})_{n\in \mathbb {N}}\) converges to uUad. Then, u is a stationary point of F, i.e.,

$$ \mathbb{E}\left[\|\mathcal{P}_{U_{\text{ad}}}(u^{*} - t \nabla F(u^{*})) -u^{*}\|^{2}\right] = 0 \quad \forall t > 0. $$

Proof

Similar to Corollary 15, \(\|\hat G_{n} - \nabla F_{n}\| \rightarrow 0\). Thus, by convergence of \((u_{n})_{n\in \mathbb {N}}\), we obtain the following:

$$ \begin{array}{@{}rcl@{}} &&\|\mathcal{P}_{U_{\text{ad}}}(u^{*} - t \nabla F(u^{*})) - u^{*}\| \\ &=& \underset{n\to \infty}{\lim} \|\mathcal{P}_{U_{\text{ad}}}(u_{n} - t \nabla F_{n}) - u_{n}\| \\ &\leq& \underset{n\to \infty}{\lim} \left( \|\mathcal{P}_{U_{\text{ad}}}(u_{n} - t \hat G_{n}) - u_{n}\|\right.\\ &&+ \left.\|\mathcal{P}_{U_{\text{ad}}}(u_{n} - t \hat G_{n}) - \mathcal{P}_{U_{\text{ad}}}(u_{n} - t \nabla F_{n})\|\right)\\ &\leq& \underset{n\to \infty}{\lim} \frac{t}{\tau_{n}}\left( \|\mathcal{P}_{U_{\text{ad}}}(u_{n} - \tau_{n} \hat G_{n}) - u_{n}\| \right.\\ &&+ \left.\|\mathcal{P}_{U_{\text{ad}}}(u_{n} - t \hat G_{n}) - \mathcal{P}_{U_{\text{ad}}}(u_{n} - t \nabla F_{n})\|\right)\\ &\leq& \underset{n\to \infty}{\lim} \frac{t}{\tau}\|\mathcal{P}_{U_{\text{ad}}}(u_{n} - \tau_{n} \hat G_{n}) - u_{n}\| + \underset{n\to \infty}{\lim} t \| \hat G_{n} - \nabla F_{n}\|\\ \end{array} $$
$$ \begin{array}{@{}rcl@{}} &=& \frac{t}{\tau} \underset{n\to \infty}{\lim} \|u_{n+1} -u_{n}\| + t\underset{n\to \infty}{\lim} \| \hat G_{n} - \nabla F_{n}\| = 0 \end{array} $$

by Lemma 18 and Lemma 11 (c). □

Note that almost all sequences \((v_{n})_{n\in \mathbb {N}}\) that are given by the random number generator in Algorithm 1 are dense in Vad. In addition to the convergence properties shown in the latter theorems, the algorithm also approximates the objective functional value with arbitrary accuracy:

Corollary 22 (Approximation ofF)

Let the series \((u_{n})_{n\in \mathbb {N}}\) be generated by Algorithm 1. Then, for all convergent subsequences \((u_{n_{k}})_{k\in \mathbb {N}}\) with \(u_{n_{k}} \rightarrow u^{*}\), we obtain for \(\hat F\) as defined in Alg. 1: \( |\hat F_{n_{k}} - F(u^{*})| \stackrel {\text {a.s.}}{\rightarrow } 0 \), assuming further \((v_{n_{k}})_{k\in \mathbb {N}}\) is dense in Vad, we obtain \( \lim _{k\rightarrow \infty } \hat F_{n_{k}} = F(u^{*}). \)

Proof

The proof is similar to the proof of Lemma 1 relying on the Lipschitz continuity of F—as a direct consequence of Assumption 1—and Corollary 13. □

Remark 23 (Termination condition)

By the regularity assumption on the objective functional in Assumption 1, the termination condition in Algorithm 1 can be posed as follows:

$$ \|\mathcal{P}_{{U_{\text{ad}}}}\left( u_{n} - \hat G_{n}\right) - u_{n} \| \leq \epsilon$$

for 𝜖 > 0. This is obviously not possible for SG. To satisfy such a condition by the SAG method, the discretization of the objective functional has to be sufficiently fine in order to approximate the gradient with sufficiently accurate. However, depending on the particular example, an a priori discretization satisfying this property can be hard to determine.

4 Numerical results

In this section, we will compare the following stochastic optimization methods mentioned in the introductory section:

  • CSG (continuous stochastic gradient method): as introduced in Section 2, this scheme relies on the computation of a single gradient in each iteration and the interpolation with previously computed information.

  • SG (stochastic gradient method): the classical stochastic optimization scheme as outlined in Bottou et al. (2018). The convergence of the method is based on on decreasing stepsizes.

  • SAG (stochastic average gradient method): an improved stochastic gradient scheme as introduced in Schmidt et al. (2017) restricted to the case of a finite sum as objective rather than an integral. The true advantage of possible larger stepsizes can be seen in the examples and is also valid for CSG. We will write SAGn (\(n\in \mathbb {N}\)) for an SAG method relying on an n-step quadrature rule to discretize the integral in the original objective.

The performance of these methods strongly depends on the chosen stepsizes. In the following examples, the stepsizes are chosen such that all the schemes have a good performance, in the range of their possibilities. However, adaptive stepsize control (also known as learning rate in the field of machine learning) is itself a subject of research for stochastic optimization schemes and is not the focus of this contribution. For example, in Kingma and Ba (2015), the stepsizes are derived from estimates of first and second moments of the gradients and in Tan et al. (2016), a Barzilai-Borwein-type stepsize adaption is discussed.

To compare the methods, we have chosen the following way to display the results. For a large number of optimization runs, we compute the quantile curves αp(n) defined as \(\mathbb {P}(u_{n} > \alpha _{p}(n)) = p\) for p ∈ (0,1). With this, we define the quantile sets Pp,q which lie in between the p and the q-quantile, i.e.,

$$ P_{p,q} := \left\{(n,v) \in \mathbb{N} \times {U_{\text{ad}}} : \alpha_{p}(n)) < u_{n} < \alpha_{q}(n))\right\}. $$
(4.1)

These areas will be colored in various degrees of opacity in order to show the behavior of the optimization procedure in its probabilistic nature.

First, we will compare the algorithms by optimizing the function defined in (2.1) and equivalently in (4.2), as well as an additional academic objective functional which will be defined in (4.3).

4.1 Academic examples

We will study the behavior of the algorithms in the following two cases:

$$ F(u) := {\int}_{ -1}^{1} \frac{v^{2}}{(u-v)^{2}+10^{-3}} \mathrm{d} v $$
(4.2)

as introduced in (2.1) (see Fig. 1) and

$$ F(u) := {\int}_{ -1}^{1} (\tanh(10 v-1)-u)^{2} \mathrm{d} v. $$
(4.3)

(see Fig. 4).

Fig. 4
figure 4

The analytic function F (blue) given in (4.3) and the function F discretized with 4 (red), 8 (yellow) equidistant discretization points

To be able to apply the SAG algorithm, we approximate the integral by the trapezoidal rule in both cases. For this, we use an equidistant grid, i.e., for \(N\in \mathbb {N}\), v0 := − 1, we define \(v_{i} = v_{0} + \text {kh} \quad k=1,\dots ,{N+1}\) with \(h:= \frac {|{\textit {V}_{\text {ad}}}|}{N}=\frac {2}{N}\). In this way, F is approximated as follows:

$$ F(u) = \sum\limits_{i=1}^{N} {\int}_{v_{i-1}}^{v_{i}} f\left( u,v_{i+\frac{1}{2}}\right) \mathrm{d}v \approx \frac{2}{m}\sum\limits_{i=1}^{N} f\left( u,v_{i-\frac{1}{2}}\right). $$

The optimization problem considered in the SAG case, thus reads as follows:

$$ \underset{u \in U_{\text{ad}}}{\min} \frac{2}{N}\sum\limits_{i=1}^{N} f\left( u,v_{i-\frac{1}{2}}\right). $$
(4.4)

The approximation error directly depends on the second derivative of f w.r.t., the second argument and the grid-spacing h, that is, the number of intervals. A finer grid thus leads to a better approximation, but also, for a deterministic gradient descent method, to a high number of problems to solve in each iteration.

The comparison of the algorithms is based on the number of function evaluations as these constitute the time-consuming steps for complex optimization tasks. We compare two different settings, one with a stepsize which is proportional to n− 0.6 (see left column in Figs. 5 and 6) and one with an appropriately chosen constant stepsize (see right column in Figs. 5 and 6). The shaded areas in Figs. 5 and 6 denote the quantile sets as defined in (4.1) for the 10 and 90% quantile (light), 20 and 80% quantile and 30 and 70% quantile (medium dark), as well as 40 and 60% quantile (dark). The quantiles are based on 105 optimization runs each with randomized initial datum. The black lines in the Figs. 5 and 6 identify the median for CSG and SG. In SAG4 and SAG8, they identify the median of the series converging to one of the local minima. In addition to this, the line thickness and the patch opacity is proportional to the probability of converging to the respective local minimum.

Fig. 5
figure 5

Comparison of CSG (top row, green), SG (second row, blue), SAG4 (third row, purple), and SAG8 (bottom row, red) in the case of objective functional (4.2) with stepsize τn = 10− 1n− 0.6 (left column) as well as constant stepsize τn = 10− 4 (right column) for 256 optimization steps each. The shaded areas denote from light to dark the quantile sets P0.1,0.9,P0.2,0.8,P0.3,0.7, and P0.4,0.6 as defined in (4.1). The dashed magenta line denote the optimal value of F. It is noted that in the case of SAG4 and SAG8, convergence towards different local minima introduced by the discretization is observed

Fig. 6
figure 6

Comparison of CSG (top row, green), SG (second row, blue), SAG4 (third row, purple), and SAG8 (bottom row, red) in the case of objective functional (4.3), stepsize τn = n− 0.6 (left column) as well as constant stepsize τn = 10− 2 (right column) for 64 optimization steps each. Note that the stationary point for SAG differs from SG and CSG due to the discretization of the objective functional. The shaded areas indicate, from light to dark, the quantile sets P0.1,0.9,P0.2,0.8,P0.3,0.7, and P0.4,0.6 as defined in (4.1). The dashed magenta line denote the optimal value of F. The artificially looking jump in the area plots are due to the non-symmetry of the objective functional

Results of numerical experiments

In case of objective functional (4.2) optimized by SAG4 and SAG8, the algorithm converges to one of the three or five local minima of the discretized objective functional, respectively (see Fig. 1 with N = 4 and N = 8). In contrast to SAG, SG (in the case of decreasing stepsizes, see Fig. 5, left) and CSG (in both considered cases, see Figs. 5 and 6) converge to the optimal value of F. The “failure” of SAG is due to the fact that SAG is only approximating the original objective functionals. On the other hand, it is clear that for a sufficiently regular function (see Assumptions 1), the optimal values of SAGN converge to the optimal solution of the original problem for \(N \rightarrow \infty \). However, a sufficiently large number N is in general not known a priori. Thus, even for a large N, it is not clear how (local) optimal solutions u as well as their objective functional values F(u) are affected by the discretization of the integral. Moreover, the convergence of the SAG method becomes slower with larger N as can be clearly seen from the more diffuse quantile sets in Figs. 5 and 6.

Figures 5 and 6 clearly show the advantage of CSG as it converges considerably faster in comparison to SG and does not converge to an artificial local minimum as SAG. While SG also approaches the optimal value at least in the case with decreasing stepsize, the speed of convergence is considerably lower compared to CSG. The true potential of CSG comes into play whenever constant stepsizes τn are chosen. This can be seen in the right columns of Figs. 5 and 6. It should be noted that the stepsize could be adapted individually for each method, in order to approach a slightly better convergence behavior. In particular, the constant stepsize we have chosen uniformly for all methods seems to be too large for SG. On the other hand, CSG can easily handle this stepsize.

Finally, it is observed that CSG combines the advantages of SAG and SG. For instance, in the left column in Fig. 6, SAG4 converges quickly but unfortunately to the “wrong” result, while SG converges to the correct limit point, but convergence is very slow. In contrast, CSG converges quickly to the correct limit point.

Another advantage of CSG is that according to Theorem 20, Corollary 21 and as discussed in Remark 22, the CSG algorithm can be stopped whenever the objective function is approximated with defined accuracy and the first-order optimality conditions are satisfied within a given error tolerance. This is, in general, neither possible for SG nor for SAG.

4.2 Structural optimization example

As a design optimization test case, we have chosen the optimization of a 2D tire, fixed at the midpoint and loaded from an arbitrary direction. A weighted sum of the expected compliance, a volume penalization term and a regularization term form the objective functional.

In detail, we consider the design domain \({\varOmega } := \{ x\in {\mathbb {R}}^{2} \mid 0.1 < \|x\| < 1 \}\)—an annulus—which is subject to a load described by the function

$$ g(x) = h_{\alpha}(\text{arctan2}(x_{1},x_{2})) n(x) $$
(4.5)

on the outer boundary ΓNΩ, and fixed with a homogeneous Dirichlet condition on the inner boundary ΓDΩ (see Fig. 7). Here, n(x) denotes the outer normal vector in xΩ and \(h_{\alpha } : {\mathbb {R}} \to {\mathbb {R}}\) denotes for \(\alpha \in {\mathbb {R}}\) the scalar function:

$$ h_{\alpha}(\beta) = 1+\tanh \left( 10^{3}\left( \cos\left( \beta-\alpha \right)-1\right)+10^{-1} \right). $$
(4.6)
Fig. 7
figure 7

Design setting for optimization problem (4.8). The normal force defined in (4.5) acts on the Neumann boundary ΓN; ΓD is a homogeneous Dirichlet boundary and Ω is subdivided into the green region which is subject to optimization and a yellow region \(\hat {\varOmega }\) with ρ = 1

The angle α describes the position where the boundary force takes its maximum value and the function can be seen as a smoothed Dirac force on the boundary (see Fig. 8).

Fig. 8
figure 8

Plot of hα as defined in (4.6) for \(\alpha = \frac {\pi }{2}\), showing the magnitude of the normal force g defined in (4.5) acting on ΓN of the domain visualized in Fig. 7

In Ω, material properties are defined using a pseudo density function ρ which is used to scale a given isotropic material characterized by Lamé parameters λ and μ. Denoting this material by E, the resulting material function is given by the SIMP law ρpE, with a penalty parameter p > 1, see Bendsøe (1989). We assume that the material properties are fixed close to the outer boundary and thus set the pseudo density ρ is set to 1 in this part of the domain, i.e., \(\rho |_{\hat {\varOmega }} \equiv 1\) with \(\hat {\varOmega } := \{ x\in {\mathbb {R}}^{2} |\ 0.9 < \|x\| < 1 \}\) (see Fig. 7, yellow marked area). In the rest of the domain, ρ serves as design variable and is allowed to vary between a small positive value ε and 1.

Now, for each admissible design ρ and each α ∈ [0,2π] a linear elasticity problem, the so-called state problem is defined on Ω applying boundary conditions as described above. The corresponding state solution is denoted by u(ρ,α).

The optimization goal is to minimize the expected compliance for angles α ∈ [0,2π]. In addition, we introduce a term to penalize the total material consumption and a filter regularization term as proposed in Semmler et al. (2018, Section 3.2). This leads to the objective functional as follows:

$$ {\int}_{ 0}^{2\pi} J(\rho, u(\rho,\alpha)) \mathrm{d}\alpha, $$

where

$$ \begin{array}{@{}rcl@{}} J(\rho, u) &:=& \gamma_{0}{\int}_{{\varGamma}_{N}} u(x)\cdot g(x) \mathrm{d} x \\ &&+\gamma_{v} {\int}_{{\varOmega}} \rho \mathrm{d} x +\gamma_{\varphi} {\int}_{{\varOmega}} \left( \rho - \frac{\rho * \varphi}{1 * \varphi } \right)^{2} \mathrm{d} x, \end{array} $$
(4.7)

γ0,γv,γφ > 0 are given scaling parameters, denotes a convolution operator in \({\mathbb {R}}^{2}\) and the filter kernel \(\varphi :{\mathbb {R}}^{2} \to {\mathbb {R}}\) is defined by the following:

$$ \varphi(x) := \max\{0,r_{0} - \|x\|\} $$

with radius r0 > 0. By finite element discretization, this leads to the following optimization problem:

$$ \begin{array}{@{}rcl@{}} &&\underset{\rho_{h} \in [\varepsilon,1]^{M}}{\min} {\int}_{0}^{2\pi} j_{h}(\rho_{h}, u_{h}(\rho_{h},\alpha)) \mathrm{d}\alpha,\\ &&s.t. \quad K(\rho_{h}) u_{h}(\rho_{h},\alpha) = g_{h}(\alpha), \ \alpha \in [0,2\pi]. \end{array} $$
(4.8)

Here, M is the number of design variables, \(K(\rho _{h}) \in {\mathbb {R}}^{N\times N}\) denotes the global stiffness matrix with N degrees of freedom, and \( g_{h}(\alpha )\in {\mathbb {R}}^{N}\) the right-hand side of the linear elastic state problem with the load centered in angle α in finite element notation. Moreover, jh is the discretized equivalent of J. In the following example, the parameters are chosen as follows: γ0 = 1,γv = 0.1,γφ = 1,r0 = 0.05, and SIMP parameter p = 3 (see, e.g., Bendsøe (1989)). As material parameters, we have chosen λ = μ = 1 and, choosing ε = 10− 2, the void stiffness was defined as 10− 6.

Computational optimization results

As in the academic examples presented in Section 4.1, we show and compare results for SG, SAG, and CSG. All optimization runs have been started with \(\rho \equiv \frac {1}{2}\) in the design domain \({\varOmega } \setminus \hat {\varOmega }\). The linear elasticity problem is discretized using ≈ 40 ⋅ 103 unstructured triangular elements generated by Triangle (see Shewchuk (1996)). Using first-order Lagrange basis functions, this discretization results in approximately the same number of degrees of freedom in terms of uh. The design domain comprises roughly 20 ⋅ 103 degrees of freedom in terms of ρh. The implementation is performed in Matlab and the linear system is solved using the direct solver available through the backslash operator.

Analogous to the previous experiments, we have chosen a suitable initial stepsize and have discretized the integral in (4.8) for SAG using a trapezoidal rule. Again, the SG method is applied to its undiscretized version to omit dependencies on the choice and accuracy of the quadrature rule. For validation and comparison purposes, the objective function is further approximated with the trapezoidal rule using a total of 180 equidistant discretization points to ensure a good approximation for the expected compliance.

In Fig. 9, the distribution of obtained objective functional values for 480 optimization runs with 2048 steps each is compared for the different methods with appropriately chosen stepsize rules. The presented results clearly show the fast convergence of CSG in comparison to the other schemes.

Fig. 9
figure 9

Comparison of CSG (green), SG (blue), SAG8 (purple), and SAG16 (red) in the case of objective functional (4.7) with τn = 5 ⋅ 103n− 0.6 (top) as well as constant stepsize τn = 750 (bottom) and 2048 optimization steps each. Note that the stationary point for SAG differs from SG and CSG due to the discretization of the objective functional. The shaded areas denote the quantile sets P0.1,0.9 (light) and P0.25,0.75 (dark) as defined in (4.1). This plot clearly shows the advantage of CSG in terms of speed of convergence as well as resulting objective functional value

Moreover, in Fig. 10, a rapid convergence of the increasingly better approximated objective function values is observed when applying the CSG method to the structural optimization problem. It is noted that, by construction, this type of convergence can neither be expected for the SG nor for the SAG method.

Fig. 10
figure 10

Relative error between the approximated objective functional by CSG and the objective functional approximated by 180 load cases in the case of the functional stated in (4.7) and constant stepsize. It shows a clear convergence trend of the approximated objective functional value \(\hat F\) (see Algorithm 1) to the true objective functional value F as predicted by Corollary 21

In Fig. 11, the so-called physical density ρp is shown for SAG, SG, and CSG for constant stepsize. While SG does not converge in 2048 iteration, SAG converges, though the resulting density distribution is strongly influenced by the discretization of the objective functional—note the 8 struts corresponding to the 8 discrete load cases applied (see the middle column of Fig. 11). No such effect is observed in the case of the CSG result (right column of Fig. 11). Moreover, the CSG result appears to be much clearer compared to the SG result, which is due to faster convergence.

Fig. 11
figure 11

Comparison of the solution ρp for SG (left column), SAG8 (middle column), and CSG (right column). The rows show from top to bottom the solution of one optimization run after 128,256,512,1024, and 2048 optimization steps in the case of objective functional (4.7) and optimization problem defined in (4.8) with suitable constant stepsize. In the middle column, one can clearly see the eight load cases applied in the discretized objective functional of SAG8. For each optimization method, the run with the lowest resulting objective functional was chosen

It is finally noted that, in the interest of comparability, we have not applied any continuation scheme (e.g., SIMP parameter \(p\rightarrow \infty \)), which would result in a true “black and white” solution, i.e., \(\rho (x) \in \{0,1\} \ \forall x\in {\varOmega } \setminus \tilde {\varOmega }\). This is of course necessary to achieve a physical interpretable and manufacturable solution. We leave the question of defining a continuation scheme suitable for CSG as a subject for further research.

5 Conclusion and outlook

In this work, we introduced the continuous stochastic gradient method, which is applicable for the solution of a broad class of structural optimization problems. Preliminary experiments with notorious academic examples as well as an application from mechanics, in which an elastic structure has been optimized with respect to infinitely many load cases, revealed that the CSG method performed better than both, the traditional SG method and its relative, SAG, in the sense that a significantly lower function value could be obtained in a defined number of iterations. This is particularly interesting as the CSG algorithm requires roughly the same computational effort per optimization iteration as the SG and the SAG method. Moreover, like the SAG method, it benefits from gradient information obtained in previous iterations. Importantly, the CSG method does not require an a priori discretization of integrals entering the objective function, and the function value and gradient can be approximated with arbitrary precision throughout the optimization iterations. The latter results in the CSG method approaching a full gradient method throughout the course of the iterations.

While the CSG method appears promising, to obtain a full picture of its behavior when applied to practical applications, more examples, e.g., from robust optimization, acoustics, or optics should be tested in future.

Furthermore, from a theoretical point of view, an analysis covering the convergence rate for convex functions would be helpful to provide a deeper understanding of the algorithm. It should also be noted that the computational effort of computing the gradient weights \({{\alpha }_{0}^{n}},\dots ,{{\alpha }_{n}^{n}}\) grows with each iteration. As compensation, an efficient implementation of the algorithm is crucial. While this can easily be done in the case of a one-dimensional index set Vad, the question remains how this can be achieved in higher dimensions. Other interesting questions center around how Lipschitz constants can be estimated throughout the optimization process and how the stepsize can be automatically adapted.

Finally, similarly as suggested in De et al. (2019), a combination with established structural optimization algorithms as, for instance, GCMAA is conceivable.