CSG: A new stochastic gradient method for the efficient solution of structural optimization problems with infinitely many states

This paper presents a novel method for the solution of a particular class of structural optimzation problems: the continuous stochastic gradient method (CSG). In the simplest case, we assume that the objective function is given as an integral of a desired property over a continuous parameter set. The application of a quadrature rule for the approximation of the integral can give rise to artificial and undesired local minima. However, the CSG method does not rely on an approximation of the integral, instead utilizing gradient approximations from previous iterations in an optimal way. Although the CSG method does not require more than the solution of one state problem (of infinitely many) per optimization iteration, it is possible to prove in a mathematically rigorous way that the function value as well as the full gradient of the objective can be approximated with arbitrary precision in the course of the optimization process. Moreover, numerical experiments for a linear elastic problem with infinitely many load cases are described. For the chosen example, the CSG method proves to be clearly superior compared to the classic stochastic gradient (SG) and the stochastic average gradient (SAG) method.


Introduction and problem statement
In the following, we define the set of Lebesgue integrable functions mapping from the space X to space Y by L 1 (X; Y ) and from the space X to the real numbers R by L 1 (X). The "dot" notation is used in the following way: g(·, y) denotes the mapping x → g(x, y).
Our goal in this article is to develop a novel stochastic gradient method for the efficient solution of optimization problems of the general form: min u∈U ad F (u) = J (f (u, ·)). (1.1) Here, u is the design variable, which can be subject to constraints described by the set U ad , and F : U ad → R is given as a composition of a functional J : L 1 (V ad ) → R and a function f : U ad × V ad → R, where V ad is a continuous parameter set. Throughout this paper, we further assume that the evaluation of the function f for any (u, v) ∈ U ad × V ad requires the solution of an underlying state problem, i. e., f is given in the form as follows: f (u, v) = j (u, y(u; v)), where y(u; v) denotes the solution of the state problem parameterized by the design u and the additional continuous index variable v ∈ V ad . As a consequence of this construction, an evaluation of the function F at a given design u theoretically requires the solution of infinitely many state problems.
In order to demonstrate that problem (1.1) has a broad range of applications, we give two particular examples for the choice of the functional J . In our first example, J is simply an integral over V ad , resulting in the problem as follows: (1. 2) The structure in (1.2) can arise in various settings. First, in an elastic setting, v could be a continuous load index and f (·, v) a compliance, displacement, or stress evaluation associated with the solution of the state problem with load index v ∈ V ad . In this case, (1.2) would be a structural optimization problem with infinitely many load cases. For optimization problems with at least many load cases, we refer, e.g., Alvarez and Carrasco (2005) and Zhang et al. (2017), as well as the references therein. Furthermore, if for applications in acoustics (see, e.g., Dilgen et al. 2019) or optics (see, e.g., Jensen and Sigmund 2011), a state problem in time-harmonic form is considered, the parameter v can play the role of a frequency or wavelength.
A prominent example for f in this context would be an L 2 -tracking function, such that (1.2) would describe the design of a device with a prescribed behavior over a continuous frequency range, such as the range of visible light, see Semmler et al. (2015). Another application in optics could be the optimization of an anisotropic object with respect to arbitrary illumination directions, again see Semmler et al. (2015). While we have so far looked at all these examples from a deterministic point of view, another important class of applications arises if we interpret F (u) as the expected value of a given property f associated with a design u. This immediately leads to the notion of reliability-based optimization problems (RBO), see, e.g., the overview article by Maute and Frangopol (2003), Conti et al. (2009) or De et al. (2020) and the references therein for a collection of more recent articles on this topic. In all these cases, the parameter v constitutes a realization of uncertainty (from a continuous uncertainty set V ad ), where the source of uncertainty can be, e.g., in loading, material properties, or stiffness.
Following this line of argumentation a little further, we come to the second instantiation of the generic problem (1.1), which is referred to as robust structural optimization problem according to De et al. (2020) and takes the form as follows: min u∈U ad E[f (u, ·)] + λVar(f (u, ·)). (1.3) Here, the expected value E[f (u, ·)]) is computed by the integral in (1.2), and λ is a positive parameter that denotes the importance of variations.
Having said this, we would like emphasize out that our paper is not the first one suggesting the application of stochastic-gradient-type methods for the solution of structural optimization problems of the aforementioned type. Zhang et al. (2017) use the classical stochastic gradient (SG) method for the efficient solution of structural optimization with many (but finitely many) load cases. In De et al. (2020), robust optimization problems in the form of (1.3) are solved by the SG method and variants, as well as a stochastic version of the well-known GCMMA framework, see Svanberg (2002). However, in this paper, a discrete set of scenarios is also assumed from the very beginning. More applications of the SG method can be found in the closely related area of inverse problems, see Haber et al. (2012) and Roosta-Khorasani et al. (2014). These are structurally similar to the problems considered in this article, in the sense that each evaluation of the function f for a given scenario v, and also requires the potentially expensive solution of a discretized partial differential equation (PDE).
However, there are at least two substantial differences between all these references and the approach we suggest in this paper. Firstly, even though in some cases in the aforementioned articles a continuous index set V ad is considered, an a priori selection of scenarios (or discretization of V ad ) is used. Secondly, even though in many applications the property function f depends on the index variable v with a certain regularity, this structure is ignored. In sharp contrast to this, we avoid an a priori discretization of integrals in this paper. This is principally due to the fact that a too coarse discretization can lead to artificial minima as will be demonstrated by means of a simple example in Section 2. Moreover, we would like to exploit the natural regularity of the property function f with respect to the index parameter v in order to design an efficient optimization algorithm in which the objective function F and its gradient are increasingly better approximated.
Beyond stochastic descent methods, robust problems of type (1.3) can also be approached by a combination of stochastic collocation methods combined with deterministic optimization solvers, see, e.g., Lazarov et al. (2012). However, also in this case, through the collocation, an approximation of the objective functional based on finitely many scenarios is chosen a priori.
To the best knowledge of the authors, the SG method itself is the only method which can, in principle, be applied to structural optimization problems with infinitely many state problems without relying on an a priori approximation of the objective functional. However, as it will be shown in the article, only the CSG method is able to successfully solve these problems.
Despite this, there are substantial structural similarities between our CSG method and the classic SG method and its relatives. Therefore, in the following, we briefly describe the basic SG idea.
The original SG method, see Robbins and Monro (1951), is a method frequently used to minimize functions of the form F : R p → R, with and f i := R p → R for all i ∈ {1, . . . , n}. Whereas conventional gradient methods calculate all n gradients ∇f i in each iteration, the stochastic gradient method uses only a small random selection thereof. Hereby, for large n (as, e.g., in machine learning applications, see Bottou and Cun 2004), the computational effort can be drastically reduced in comparison to the classical gradient method. Bottou et al. (2018) and Schmidt et al. (2017) introduced an improved version of the SG algorithm, the so-called stochastic average gradient method (SAG). This benefits from previously calculated gradients, leading to a better approximation of the exact gradient. Thus, a better convergence behavior is typically observed. In this paper, the basic properties of the SG and the SAG method will be combined.
-Low computational effort per iteration -Reuse of previously obtained information We will integrate these two properties in our CSG algorithm and compare it with the SG and the SAG method by means of examples. The remainder of this paper is structured as follows: We will first close this section with the formal statement of the problem. In Section 2, we introduce the proposed CSG algorithm by which we aim to solve the types of problems discussed. In Section 3, we then analytically study the convergence of the algorithm and state assumptions necessary for convergence. The main theoretical results are given in the two Theorems 19 and 20. Section 4 provides first numerical results of the CSG algorithm, as well as a comparison with the SG and SAG methods. Finally, in Section 5, we provide a summary and a brief outlook for further scientific work.

Formal statement of the problem
Generally, we look at the objective functionals as defined in (1.1), and we further assume the following: Assumption 1 (Objective functional) The reduced objective functional F (u) : U ad → R is given by the composition of a mapping J : L 1 (V ad ; R) → R and f : U ad ×V ad → R. The Fréchet derivative of J will be denoted by its associated function D J := L ∞ (V ad ; R) × V ad → R. D J and ∇ u f have to be Lipschitz continuous w.r.t. both their arguments in the respective topology.
For a definition of the Fréchet differential, see for instance (Jahn 2007, Def. 3.11.). With F chosen as in (1.1) and the latter assumption, we can state its derivative as follows: Remark 2 (Derivative of F ) The derivative of F is then given by the following: for all u ∈ U ad , with D J being the Fréchet differential as defined in Assumption 1.
The Fréchet derivative mentioned is exemplified in two relevant cases:

Remark 3 (Examples for Fréchet derivatives)
If J (f (u, ·)) is the expected value of f w.r.t. the second argument, we obtain the following: and for J (f (u, ·)) being the variance of f with respect to the second argument, we obtain for all v ∈ V ad , as follows: As already mentioned, in the case of structural optimization, for each parameter v ∈ V ad , the evaluation of f (·, v) requires the solution of an associated state problem. Consequently, the (approximate) evaluation of F and its gradient given in (1.4) is computationally very expensive. In general, it can be stated that the algorithm introduced in Section 2 is especially attractive for problems in which F is numerically expensive to evaluate due to its functional dependency.
One advantage of our algorithm in comparison to the SG and the SAG method is that no (a priori) quadrature rule is used to approximate the integral in (1.4). In this way, we later gain convergence of a subsequence to a stationary point of the continuous problem (1.1). Moreover, this helps to avoid artificial local minima, as will be outlined in the next section.

Continuous stochastic gradient method
Before presenting the new optimization algorithm, we will briefly discuss how discretization of the objective functional can introduce local minima, looking at the following function as follows: (2.1) By choosing U ad = [− 1 2 , 1 2 ] and V ad = (−1, 1), this corresponds to problem (1.1).
In Fig. 1, the graph of the function F in (2.1) is shown along with numerical approximations based on equidistant discretizations of the integral. Although the original function is convex, local minima are introduced due to the discretization of the integral. It is noted that without much information on the function f , it is hard to choose a suitable discretization, which would avoid this effect. Nevertheless, a good algorithm should be able to prevent convergence to such an artificial local minimum. In fact, this is one of the key features of the CSG algorithm introduced in detail in the following section.

The CSG algorithm
With P U ad being the orthogonal projection onto U ad , λ d v being the Lebesgue measure in the d v -dimensional space, and being the set of points that are closer to (u i , ω i ) than they are to (u j , ω j ) in the · * -norm given in Definition 10, we can state the proposed Algorithm 1: Fig. 1 The analytic function F (blue) and the function F discretized with 4 (red), 8 (yellow), 16 (green), and 32 (purple) equidistant discretization points. F is given in (2.1) The CSG method as defined in Algorithm 1 is suitable for problems of the form (1.1) and is structured as most gradient descent methods. In each iteration n, a search directionĜ n (an approximation of the gradient ∇F n := ∇F (u n )) is calculated, a step length τ n > 0 is chosen and a sequence (u n ) n∈N is generated by the following: Note that the existence and uniqueness of P U ad is guaranteed by the projection theorem (see, e.g., Aubin (2000)) and for all w ∈ R d u defined by the following: In this contribution, we use the following abbreviations: F n := F (u n ) and · denote the euclidean norm in the respective dimensions. The distinctive feature of the algorithm lies in the calculation of the search directionĜ n . In each iteration n, the gradient ∇ u f (u n , ·) is evaluated at a random position ω n ∈ V ad and stored for later iterations. The search directionĜ n is in principle a linear combination of the former gradients g i := ∇ u f (u i , ω i ), i = 0, . . . , n with weights α n i , i = 0, . . . , n. To provide an idea how the weights are calculated, we refer to the sketch in Fig. 2. There, ω 0 , . . . , ω 10 ∈ V ad are randomly sampled points and g 0 , . . . , g 10 the corresponding gradients. Then, the approximate gradient is given asĜ 10 = 10 k=0 a k g k , where a 10 0 , . . . , a 10 10 are the lengths of the line segments associated with the points (ω 0 , u 0 ), . . . , (ω 10 , u 10 ). Here, the assignment of segments to points is indicated by the same color. The underlying structure is the Voronoi diagram Fig. 2 Example for the calculation of the approximated gradientĜ n in Algorithm 1 in the case of V ad , U ad ⊂ R of the points (ω k , u k ) k∈{1,...,10} (see, e.g., Fortune (1995)). More formally, the weights α 10 k can be defined for all k ∈ {0, . . . , 10} as the We note that the computational complexity per iteration is given by the evaluation of the gradient of the function f and the calculation of the weights α n 0 , . . . , α n n . Up to the calculation of the weights, this is analogous to the SG and SAG method. It should also be noted that all the gradients g 0 , . . . , g n−1 of the previous iterations are included in the current iteration. In Section 3, we will show that the error Ĝ n − ∇F n almost sure converges to zero. Hence, the algorithm does not become trapped in a local minimum of the discretized function. Therefore, the problem described in Fig. 1 will not arise, since the approximation of ∇F n bŷ G n will be better and better.

Convergence analysis
In this section, we will study the convergence of the proposed algorithm. Due to the randomly chosen evaluation point within the algorithm, we will have to study probabilistic convergence behavior in terms of "almost sure convergence" and convergence in expectation. This notion of convergence as well as further assumptions and definitions are given in Section 3.1. In Section 3.2, we prove that the error in the gradient approximation goes to zero and finally apply this result in Section 3.3 to prove convergence of the CSG method.

Assumptions, definitions, and preliminary results
For the convergence analysis of Algorithm 1, the following three assumptions on the objective functional, the step length, and the sets U ad , V ad are an important ingredient. In the following, we will assume that these Assumptions are always satisfied without mentioning it explicitly.

Definition 4 (Lipschitz constants and maxima)
We will denote the Lipschitz constants of D J and ∇ u f with respect to both their arguments by L (1) D J , L (2) D J and by L (1) ∇uf , L (2) ∇uf . Their maximal absolute function values will be defined as CD J , C∇ uf .
For U ad and V ad , we assume the following: The latter assumption is fulfilled for non pathological open sets with finite perimeter.
Assumption 6 (Step length) The step length (τ n ) n∈N satisfies the following: These conditions on the step length satisfy the conditions stated in Robbins and Monro (1951, Eqns. (6) and (26)) as well as equivalently in Bottou et al. (2018, Eqn. (4.19) in the one-dimensional case, and can be seen as a higher dimensional equivalent.

Remark 7 (
Step length for d v = 1 and d v = 2) In case of a one-or two-dimensional set V ad , Assumption 6 is satisfied iff with the Big Oh and Little Oh notation as defined in Bürgisser and Cucker (2013). In other words, the null series (τ n ) n∈N must not tend faster to zero than (n −1 ) n∈N but not as slow as (n − 1 2 ) n∈N .
The lower bound for the stepsizes ensure that the accumulated stepsizes reach infinity and the algorithm does not get stuck due to their reduction. The upper bound ensures that the approximation of the search direction is appropriate. This is equivalent to the rate of convergence for empirical measures, see, e.g., Dudley (1969, Prop. 3.4.).
Despite these assumptions in the stepsize, in Theorem 20, a result will be stated for the case of a step length bounded away from zero.
To show convergence of the algorithm, we must first state the probability space setting.
Definition 8 (Probability space setup) The probability space (Ω, A, P) is given by the following: All the following random variables are defined in this setting. For the convergence of random variables, we use the following commonly used notation: Definition 9 (Stochastic convergence) A sequence of random variables (Z n ) n∈N converges almost surely to some random variable Z iff In this document, we define the following norm on the product space U ad × V ad .
Definition 10 (Norm in U ad × V ad ) For better readability, we define the following 1 / 2 -norm on the product space U ad × V ad : Due to norm equivalence in finite dimensional spaces, the results presented later hold for all chosen norms in U ad and V ad and combinations thereof.
The orthogonal projection used in Algorithm 1 has some important properties: Lemma 11 (Orthogonal projection) Let S ⊂ R n for n ∈ N >0 be closed and convex and let P S be the orthogonal projection, then the following holds for all x, y ∈ R n and z ∈ S: For h ∈ C 1 (U ad ) and U ad convex, the following sufficient conditions for first-order optimality are equivalent: Corollary 12 (Optimality conditions) For all u * ∈ U ad , the following items are equivalent: We call u * satisfying one of the above conditions a stationary point.
Proof Define for u ∈ U ad the cone Using Lemma 11 ((i) for "⇒" and (ii) for "⇐"), it is straightforward to see that for x ∈ R d u and u ∈ U ad ,

Error in the approximate gradient
In this subsection, we analyze the error in the nth iteration of the approximate gradientĜ n and the gradient of the objective functional ∇F n . To do this, we define for v ∈ V ad , ω ∈ Ω the sequence of random variables (X n ) n∈N by the following: where ε n := 2C∇ uf c 2 |V ad |n − δ 2 +ε n andε n := n δ 2 − 1 max (2,dv ) with c 2 and δ defined in Assumption 6 and C∇ uf in Definition 4. Moreover, with V ε n ad defined in Assumption 5.
By the limit comparison test, the corresponding series converge. Finally, note that Assumption 5 gives that v ∈ V ε n ad implies B ε n (v) ⊂ V ad and therefore sup v∈V εn ad P(X n (·; v) ≥ ε n ) ≤ 1 − ν · n δ 2 a n a n .
As a direct consequence of the latter result, we obtain almost sure convergence: Proof The result follows by Lemma 1 and the Borel-Cantelli Lemma, see, e.g., Klenke (2013, Thm. 6.12).
Thus, due to the Lipschitz continuity of ∇ u f , and D J the integral in ∇F (u n ) is increasingly is increasingly better approximated byĜ n for n → ∞ : Corollary 15 (Error in gradient approximation) The norm of the difference between approximate gradientĜ n in the nth iteration (defined in Algorithm 1) and the gradient of the exact objective functional ∇F in u n goes to zero, i.e., Then, wheref (u n , v) := f (u k n (v) , ω k n (v) ). By the Lipschitz continuity assumed in Assumption 1, we therefore obtain the following: with constants defined in Definition 4 and X n (·; v) as defined in Lemma 1. Recall that U ad , V ad are bounded. Now, the almost sure convergence, as well as the convergence of the expectations, is followed by Lebesgue's dominated convergence result.
Finally, let C be a generic constant and ε > 0. Since sup v∈V ad X n (·; v) ≤ D < ∞, where D := diam(V ad ) + diam(U ad ) denotes the diameter of V ad plus the diameter of U ad , and by Fubini's theorem, we have the following: where V r ad is given in Assumption 5. If we choose ε = ε n = 2C∇ uf c 2 · n − 1 2 + n − 1 max(2,dv ) + δ 2 as in Lemma 1, we obtain the following: which concludes the proof.

Convergence result
As we have seen in Corollary 14, the error Ĝ n − ∇F n converges almost surely and in expectation to zero for n → ∞. It remains to provide sufficient conditions under which the algorithm converges to a stationary point.

Lemma 16 (Objective functional values)
The difference of the objective functional values in iteration n ∈ N can be approximated as follows: with φ n := τ n ∇F n −Ĝ n · Ĝ n + τ 2 n C Ĝ n 2 .
with a constant C ∈ R ≥0 depending on the lipschitz constants and suprema of the involved functions.
Proof By the mean value theorem, there is a c ∈ (0, 1) such that (we set ∇F c n := ∇F ((1 − c)u n + cu n+1 ) using the Cauchy-Schwartz inequality and Definition 4. Recall that u n+1 = P U ad (u n − τ nĜn ). With Lemma 11 (b), (c), and the Cauchy-Schwartz inequality for the first term of the right-hand side of the latter equation, we obtain the following: Applying Lemma 11 (c) to the second term yields P U ad (u n − τ nĜn ) − u n 2 ≤ τ 2 n Ĝ n 2 .
Since the first term in right-hand side of the estimate in the above Lemma is strictly negative while the second term is strictly positive, we may expect a descent, provided E [φ n ] is small enough.

Corollary 17 (Convergence result) We have the following:
Proof Since Ĝ n is bounded and ∞ n=1 τ 2 n < ∞ by Assumption 6, the result follows by (3.2).
Before we present our main results, we need the following auxiliary result:

Lemma 18 (Projection of gradient steps)
Proof Define x := u n , y :=Ĝ n , and τ := τ n . We assume that x − τy / ∈ U ad (otherwise the result follows by Lemma 11) and set n τ := x − τy − P U ad (x − τy) and Since U ad is convex, we have U ad ⊂ H , and therefore ∀u ∈ R d u by Lemma 11, where P H is the orthogonal projection onto H (compare Fig. 3). With B := t τ P U ad (x − τy) − x + x and (3.3), we have the following: Recalling the characterization of stationary points from Corollary 12, we obtain our first main result: Proof First, note that by the compactness of U ad and regularity of F , F inf := inf u∈U ad F (u) > −∞. Summing both sides of the inequality in Lemma 16 up to an N ∈ N gives the following: Hence, by Corollary 17, Using Lemma 11 (ii) together with Lemma 18 gives for all n ∈ N sufficiently large, Since u n+1 −u n ≤ τ n Ĝ n , this combined with Corollary 15 gives the result.
As a direct consequence, we have the following: Theorem 20 (Main theorem) Let (u n ) n∈N be generated by Algorithm 1. Then, there exists a sub-sequence (u n k ) k∈N converging to a stationary point, i.e., Proof Direct consequence of Theorem 19.
For applications, the condition on the step length (Assumption 6) is inconvenient, since the step length becomes very small and the algorithm thus progresses only slowly. If the algorithm is performed with a constant stepsize, and if the sequence (u n ) n∈N converges to some u * ∈ U ad , then the following theorem demonstrates that u * is a stationary point.
Theorem 21 (Convergent series) Assume the time-step series (τ n ) n∈N satisfies τ n ≥ τ ∀n ∈ N for some τ > 0. Let further (v n ) n∈N be dense in V ad and assume (u n ) n∈N converges to u * ∈ U ad . Then, u * is a stationary point of F , i.e., Proof Similar to Corollary 15, Ĝ n − ∇F n → 0. Thus, by convergence of (u n ) n∈N , we obtain the following: by Lemma 18 and Lemma 11 (c).
Note that almost all sequences (v n ) n∈N that are given by the random number generator in Algorithm 1 are dense in V ad . In addition to the convergence properties shown in the latter theorems, the algorithm also approximates the objective functional value with arbitrary accuracy: Corollary 22 (Approximation of F ) Let the series (u n ) n∈N be generated by Algorithm 1. Then, for all convergent subsequences (u n k ) k∈N with u n k → u * , we obtain forF as defined in Alg. 1: |F n k − F (u * )| a.s. → 0, assuming further (v n k ) k∈N is dense in V ad , we obtain lim k→∞Fn k = F (u * ).
Proof The proof is similar to the proof of Lemma 1 relying on the Lipschitz continuity of F -as a direct consequence of Assumption 1-and Corollary 13.
Remark 23 (Termination condition) By the regularity assumption on the objective functional in Assumption 1, the termination condition in Algorithm 1 can be posed as follows: for > 0. This is obviously not possible for SG. To satisfy such a condition by the SAG method, the discretization of the objective functional has to be sufficiently fine in order to approximate the gradient with sufficiently accurate. However, depending on the particular example, an a priori discretization satisfying this property can be hard to determine.

Numerical results
In this section, we will compare the following stochastic optimization methods mentioned in the introductory section: -CSG (continuous stochastic gradient method): as introduced in Section 2, this scheme relies on the computation of a single gradient in each iteration and the interpolation with previously computed information. -SG (stochastic gradient method): the classical stochastic optimization scheme as outlined in Bottou et al. (2018). The convergence of the method is based on on decreasing stepsizes. -SAG (stochastic average gradient method): an improved stochastic gradient scheme as introduced in Schmidt et al. (2017) restricted to the case of a finite sum as objective rather than an integral. The true advantage of possible larger stepsizes can be seen in the examples and is also valid for CSG. We will write SAG n (n ∈ N) for an SAG method relying on an n-step quadrature rule to discretize the integral in the original objective.
The performance of these methods strongly depends on the chosen stepsizes. In the following examples, the stepsizes are chosen such that all the schemes have a good performance, in the range of their possibilities. However, adaptive stepsize control (also known as learning rate in the field of machine learning) is itself a subject of research for stochastic optimization schemes and is not the focus of this contribution. For example, in Kingma and Ba (2015), the stepsizes are derived from estimates of first and second moments of the gradients and in Tan et al. (2016), a Barzilai-Borwein-type stepsize adaption is discussed.
To compare the methods, we have chosen the following way to display the results. For a large number of optimization runs, we compute the quantile curves α p (n) defined as P(u n > α p (n)) = p for p ∈ (0, 1). With this, we define the quantile sets P p,q which lie in between the p and the q-quantile, i.e., P p,q := (n, v) ∈ N × U ad : α p (n)) < u n <α q (n)) . (4.1) These areas will be colored in various degrees of opacity in order to show the behavior of the optimization procedure in its probabilistic nature.
First, we will compare the algorithms by optimizing the function defined in (2.1) and equivalently in (4.2), as well as an additional academic objective functional which will be defined in (4.3).

Academic examples
We will study the behavior of the algorithms in the following two cases: as introduced in (2.1) (see Fig. 1) and Fig. 4).

Fig. 5
Comparison of CSG (top row, green), SG (second row, blue), SAG 4 (third row, purple), and SAG 8 (bottom row, red) in the case of objective functional (4.2) with stepsize τ n = 10 −1 n −0.6 (left column) as well as constant stepsize τ n = 10 −4 (right column) for 256 optimization steps each. The shaded areas denote from light to dark the quantile sets P 0.1,0.9 , P 0.2,0.8 , P 0.3,0.7 , and P 0.4,0.6 as defined in (4.1). The dashed magenta line denote the optimal value of F . It is noted that in the case of SAG 4 and SAG 8 , convergence towards different local minima introduced by the discretization is observed To be able to apply the SAG algorithm, we approximate the integral by the trapezoidal rule in both cases. For this, we use an equidistant grid, i.e., for N ∈ N, v 0 := −1, we define v i = v 0 + kh k = 1, . . . , N + 1 with h := |V ad | N = 2 N . In this way, F is approximated as follows: The optimization problem considered in the SAG case, thus reads as follows: (4.4) The approximation error directly depends on the second derivative of f w.r.t., the second argument and the gridspacing h, that is, the number of intervals. A finer grid thus leads to a better approximation, but also, for a deterministic gradient descent method, to a high number of problems to solve in each iteration. The comparison of the algorithms is based on the number of function evaluations as these constitute the time-consuming steps for complex optimization tasks. We compare two different settings, one with a stepsize which is proportional to n −0.6 (see left column in Figs. 5 and 6) and one with an appropriately chosen constant stepsize (see right column in Figs. 5 and 6). The shaded areas in Figs. 5 and 6 denote the quantile sets as defined in (4.1) for the 10 and 90% quantile (light), 20 and 80% quantile and 30 and 70% quantile (medium dark), as well as 40 and 60% quantile (dark). The quantiles are based on 10 5 optimization runs each with randomized initial datum. The black lines in the Figs. 5 and 6 identify the median for CSG and SG. In SAG 4 and SAG 8 , they identify the median of the series converging to one of the local minima. In addition to this, the line thickness and the patch opacity is proportional to the probability of converging to the respective local minimum.

Results of numerical experiments
In case of objective functional (4.2) optimized by SAG 4 and SAG 8 , the algorithm converges to one of the three or five local minima of the discretized objective functional, respectively (see Fig. 1 with N = 4 and N = 8). In contrast to SAG, SG (in the case of decreasing stepsizes, see Fig. 5, left) and CSG (in both considered cases, see Figs. 5 and 6) converge to the optimal value of F . The "failure" of SAG is due to the fact that SAG is only approximating the original objective functionals. On the other hand, it is clear that for a sufficiently regular function (see Assumptions 1), the optimal values of SAG N converge to the optimal solution of the original problem for N → ∞. However, a sufficiently large number N is in general not known a priori. Thus, even for a large N, it is not clear how (local) optimal solutions u * as well as their objective functional values F (u * ) are affected by the discretization of the integral. Moreover, the convergence of the SAG method becomes slower with larger N as can be clearly seen from the more diffuse quantile sets in Figs. 5 and 6. Figures 5 and 6 clearly show the advantage of CSG as it converges considerably faster in comparison to SG and does not converge to an artificial local minimum as SAG. While SG also approaches the optimal value at least in the case with decreasing stepsize, the speed of convergence is considerably lower compared to CSG. The true potential of CSG comes into play whenever constant stepsizes τ n are chosen. This can be seen in the right columns of Figs. 5 and 6. It should be noted that the stepsize could be adapted individually for each method, in order to approach a slightly better convergence behavior. In particular, the Fig. 6 Comparison of CSG (top row, green), SG (second row, blue), SAG 4 (third row, purple), and SAG 8 (bottom row, red) in the case of objective functional (4.3), stepsize τ n = n −0.6 (left column) as well as constant stepsize τ n = 10 −2 (right column) for 64 optimization steps each. Note that the stationary point for SAG differs from SG and CSG due to the discretization of the objective functional. The shaded areas indicate, from light to dark, the quantile sets P 0.1,0.9 , P 0.2,0.8 , P 0.3,0.7 , and P 0.4,0.6 as defined in (4.1). The dashed magenta line denote the optimal value of F . The artificially looking jump in the area plots are due to the non-symmetry of the objective functional constant stepsize we have chosen uniformly for all methods seems to be too large for SG. On the other hand, CSG can easily handle this stepsize.
Finally, it is observed that CSG combines the advantages of SAG and SG. For instance, in the left column in Fig. 6, SAG 4 converges quickly but unfortunately to the "wrong" result, while SG converges to the correct limit point, but convergence is very slow. In contrast, CSG converges quickly to the correct limit point.
Another advantage of CSG is that according to Theorem 20, Corollary 21 and as discussed in Remark 22, the CSG algorithm can be stopped whenever the objective function is approximated with defined accuracy and the first-order optimality conditions are satisfied within a given error tolerance. This is, in general, neither possible for SG nor for SAG.

Structural optimization example
As a design optimization test case, we have chosen the optimization of a 2D tire, fixed at the midpoint and loaded from an arbitrary direction. A weighted sum of the expected compliance, a volume penalization term and a regularization term form the objective functional.
In detail, we consider the design domain Ω := {x ∈ R 2 | 0.1 < x < 1}-an annulus-which is subject to a load described by the function on the outer boundary Γ N ⊂ ∂Ω, and fixed with a homogeneous Dirichlet condition on the inner boundary Γ D ⊂ ∂Ω (see Fig. 7). Here, n(x) denotes the outer normal vector in x ∈ ∂Ω and h α : R → R denotes for α ∈ R the scalar function: h α (β) = 1 + tanh 10 3 (cos (β − α) − 1) + 10 −1 . (4.6) The angle α describes the position where the boundary force takes its maximum value and the function can be seen as a smoothed Dirac force on the boundary (see Fig. 8).
In Ω, material properties are defined using a pseudo density function ρ which is used to scale a given isotropic material characterized by Lamé parameters λ and μ. Denoting this material by E, the resulting material function is given by the SIMP law ρ p E, with a penalty parameter p > 1, see Bendsøe (1989). We assume that the material properties are fixed close to the outer boundary and thus set the pseudo density ρ is set to 1 in this part of the domain, i.e., ρ|Ω ≡ 1 withΩ := {x ∈ R 2 | 0.9 < x < 1} (see Fig. 7, yellow marked area). In the rest of the domain, ρ serves as design variable and is allowed to vary between a small positive value ε and 1. Now, for each admissible design ρ and each α ∈ [0, 2π ] a linear elasticity problem, the so-called state problem is Fig. 7 Design setting for optimization problem (4.8). The normal force defined in (4.5) acts on the Neumann boundary Γ N ; Γ D is a homogeneous Dirichlet boundary and Ω is subdivided into the green region which is subject to optimization and a yellow regionΩ with ρ = 1 defined on Ω applying boundary conditions as described above. The corresponding state solution is denoted by u(ρ, α).
The optimization goal is to minimize the expected compliance for angles α ∈ [0, 2π ]. In addition, we introduce a term to penalize the total material consumption and a filter regularization term as proposed in Semmler et al. (2018, Section 3.2). This leads to the objective functional as follows: γ 0 , γ v , γ ϕ > 0 are given scaling parameters, " * " denotes a convolution operator in R 2 and the filter kernel ϕ : R 2 → R is defined by the following: ϕ(x) := max{0, r 0 − x } Fig. 8 Plot of h α as defined in (4.6) for α = π 2 , showing the magnitude of the normal force g defined in (4.5) acting on Γ N of the domain visualized in Fig. 7 with radius r 0 > 0. By finite element discretization, this leads to the following optimization problem: (4.8) Here, M is the number of design variables, K(ρ h ) ∈ R N×N denotes the global stiffness matrix with N degrees of freedom, and g h (α) ∈ R N the right-hand side of the linear elastic state problem with the load centered in angle α in finite element notation. Moreover, j h is the discretized Fig. 9 Comparison of CSG (green), SG (blue), SAG 8 (purple), and SAG 16 (red) in the case of objective functional (4.7) with τ n = 5 · 10 3 n −0.6 (top) as well as constant stepsize τ n = 750 (bottom) and 2048 optimization steps each. Note that the stationary point for SAG differs from SG and CSG due to the discretization of the objective functional. The shaded areas denote the quantile sets P 0.1,0.9 (light) and P 0.25,0.75 (dark) as defined in (4.1). This plot clearly shows the advantage of CSG in terms of speed of convergence as well as resulting objective functional value equivalent of J . In the following example, the parameters are chosen as follows: γ 0 = 1, γ v = 0.1, γ ϕ = 1, r 0 = 0.05, and SIMP parameter p = 3 (see, e.g., Bendsøe (1989)). As material parameters, we have chosen λ = μ = 1 and, choosing ε = 10 −2 , the void stiffness was defined as 10 −6 .
Computational optimization results As in the academic examples presented in Section 4.1, we show and compare results for SG, SAG, and CSG. All optimization runs have been started with ρ ≡ 1 2 in the design domain Ω \Ω. The linear elasticity problem is discretized using ≈ 40 · 10 3 unstructured triangular elements generated by TRIANGLE (see Shewchuk (1996)). Using first-order Lagrange basis functions, this discretization results in approximately the same number of degrees of freedom in terms of u h . The design domain comprises roughly 20 · 10 3 degrees of freedom in terms of ρ h . The implementation is performed in MATLAB and the linear system is solved using the direct solver available through the backslash operator.
Analogous to the previous experiments, we have chosen a suitable initial stepsize and have discretized the integral in (4.8) for SAG using a trapezoidal rule. Again, the SG method is applied to its undiscretized version to omit dependencies on the choice and accuracy of the quadrature rule. For validation and comparison purposes, the objective function is further approximated with the trapezoidal rule using a total of 180 equidistant discretization points to ensure a good approximation for the expected compliance.
In Fig. 9, the distribution of obtained objective functional values for 480 optimization runs with 2048 steps each is compared for the different methods with appropriately chosen stepsize rules. The presented results clearly show the fast convergence of CSG in comparison to the other schemes.
Moreover, in Fig. 10, a rapid convergence of the increasingly better approximated objective function values is observed when applying the CSG method to the structural optimization problem. It is noted that, by construction, this type of convergence can neither be expected for the SG nor for the SAG method.
In Fig. 11, the so-called physical density ρ p is shown for SAG, SG, and CSG for constant stepsize. While SG does Fig. 10 Relative error between the approximated objective functional by CSG and the objective functional approximated by 180 load cases in the case of the functional stated in (4.7) and constant stepsize. It shows a clear convergence trend of the approximated objective functional valueF (see Algorithm 1) to the true objective functional value F as predicted by Corollary 21 Fig. 11 Comparison of the solution ρ p for SG (left column), SAG 8 (middle column), and CSG (right column). The rows show from top to bottom the solution of one optimization run after 128,256,512,1024, and 2048 optimization steps in the case of objective functional (4.7) and optimization problem defined in (4.8) with suitable constant stepsize. In the middle column, one can clearly see the eight load cases applied in the discretized objective functional of SAG 8 . For each optimization method, the run with the lowest resulting objective functional was chosen not converge in 2048 iteration, SAG converges, though the resulting density distribution is strongly influenced by the discretization of the objective functional-note the 8 struts corresponding to the 8 discrete load cases applied (see the middle column of Fig. 11). No such effect is observed in the case of the CSG result (right column of Fig. 11). Moreover, the CSG result appears to be much clearer compared to the SG result, which is due to faster convergence.
It is finally noted that, in the interest of comparability, we have not applied any continuation scheme (e.g., SIMP parameter p → ∞), which would result in a true "black and white" solution, i.e., ρ(x) ∈ {0, 1} ∀x ∈ Ω \Ω. This is of course necessary to achieve a physical interpretable and manufacturable solution. We leave the question of defining a continuation scheme suitable for CSG as a subject for further research.

Conclusion and outlook
In this work, we introduced the continuous stochastic gradient method, which is applicable for the solution of a broad class of structural optimization problems. Preliminary experiments with notorious academic examples as well as an application from mechanics, in which an elastic structure has been optimized with respect to infinitely many load cases, revealed that the CSG method performed better than both, the traditional SG method and its relative, SAG, in the sense that a significantly lower function value could be obtained in a defined number of iterations. This is particularly interesting as the CSG algorithm requires roughly the same computational effort per optimization iteration as the SG and the SAG method. Moreover, like the SAG method, it benefits from gradient information obtained in previous iterations. Importantly, the CSG method does not require an a priori discretization of integrals entering the objective function, and the function value and gradient can be approximated with arbitrary precision throughout the optimization iterations. The latter results in the CSG method approaching a full gradient method throughout the course of the iterations.
While the CSG method appears promising, to obtain a full picture of its behavior when applied to practical applications, more examples, e.g., from robust optimization, acoustics, or optics should be tested in future.
Furthermore, from a theoretical point of view, an analysis covering the convergence rate for convex functions would be helpful to provide a deeper understanding of the algorithm. It should also be noted that the computational effort of computing the gradient weights α n 0 , . . . , α n n grows with each iteration. As compensation, an efficient implementation of the algorithm is crucial. While this can easily be done in the case of a one-dimensional index set V ad , the question remains how this can be achieved in higher dimensions. Other interesting questions center around how Lipschitz constants can be estimated throughout the optimization process and how the stepsize can be automatically adapted. Finally, similarly as suggested in De et al. (2020), a combination with established structural optimization algorithms as, for instance, GCMAA is conceivable.