The Continuous Stochastic Gradient Method: Part I -- Convergence Theory

In this contribution, we present a full overview of the continuous stochastic gradient (CSG) method, including convergence results, step size rules and algorithmic insights. We consider optimization problems in which the objective function requires some form of integration, e.g., expected values. Since approximating the integration by a fixed quadrature rule can introduce artificial local solutions into the problem while simultaneously raising the computational effort, stochastic optimization schemes have become increasingly popular in such contexts. However, known stochastic gradient type methods are typically limited to expected risk functions and inherently require many iterations. The latter is particularly problematic, if the evaluation of the cost function involves solving multiple state equations, given, e.g., in form of partial differential equations. To overcome these drawbacks, a recent article introduced the CSG method, which reuses old gradient sample information via the calculation of design dependent integration weights to obtain a better approximation to the full gradient. While in the original CSG paper convergence of a subsequence was established for a diminishing step size, here, we provide a complete convergence analysis of CSG for constant step sizes and an Armijo-type line search. Moreover, new methods to obtain the integration weights are presented, extending the application range of CSG to problems involving higher dimensional integrals and distributed data.


Introduction
In this contribution, we present a full overview of the continuous stochastic gradient (CSG) method, including convergence results, step size rules, algorithmic insights and applications from topology optimization.The prototype idea for CSG was first proposed in [1].Therein, it was shown that for expected-valued objective functions, CSG with diminishing step sizes and exact integration weights (see Section 3) almost surely produces a subsequence converging to a stationary point [1,Theorem 20].This work severely generalizes this result, providing proofs for convergence of the full sequence of iterates in the case of constant step sizes and backtracking line-search techniques.Additionally, the convergence results hold in a less restrictive setting and for generalized approaches to the integration weight calculation (see Section 3).Before going into details, we want to explain the reason for introducing what seems like yet another first-order stochastic optimization scheme.

Motivation from PDE-constrained Optimization
Within PDE-constrained optimization, settings with expected-valued objective functions arise in numerous applications, ranging from purely stochastic settings, like machine learning or noisy simulations [2,3], to fully deterministic problems, in which one is interested in a design that is optimal for an infinite set of outer parameters, e.g.[4][5][6][7].Especially in large scale settings, one usually does not consider deterministic approaches (see, e.g., [8]) for the solution of such problems, as they are generally too computationally expensive or even intractable.Instead, one uses stochastic optimization schemes, like the Stochastic Gradient (SG) [9] or Stochastic Average Gradient (SAG) method [10].A large number of schemes have been derived from these and thoroughly analyzed, including sequential quadratic programming for stochastic optimization [11,12], quasi-Newton stochastic gradient schemes [13][14][15][16] and the well known adaptive gradient schemes Adam & Adagrad [17,18], to name some prominent examples.
Problematically, such methods rely on a heavily restrictive setting, in which the objective function value of a design u is simply given as the expected value of some quantity j, i.e., J(u) = E x [j(u, x)].Even the basic setting of nesting two expectation values, i.e., J(u) = E y [j 2 (y, E x [(j 1 (u, x)])], is beyond the scope of the mentioned schemes and requires special techniques, e.g. the Stochastic Composition Gradient Descent (SCGD) method [19], which itself is again only applicable in this specific setting.
In this contribution, we investigate an application from the field of optimal nanoparticle design, which admits exactly such a complex structure.Our main interest lies in the optical properties of nanoparticles.Specifically, the color of particulate products, which has been of great interest for many fields of research [20][21][22][23][24][25], is what we are trying to optimize for.This and similar applications serve only as motivation in this paper.However, in the numerical analysis of CSG [26], it is demonstrated that CSG indeed represents an efficient approach to such problems.While a detailed introduction to this setting is given in [26], we want to briefly summarize the problems arising in this application.
To obtain the color of a particulate product, we need to calculate the important optical properties of the nanoparticles in the product.For each particle design, this requires solving the time-harmonic Maxwell's equations, which, depending of the setting, is numerically expensive.Furthermore, the color of the whole product is not determined by the optical properties of a single particle.Instead, we need to average these properties over, e.g., the particle design distribution and their orientation in the particulate product.Afterwards, the averaged values are used to calculate intermediate results for the special setting.These results then need to be integrated over the range of wavelengths, which are visible to the human eye, to obtain the color of the particulate product.Finally, the objective function uses the resulting color in a nonlinear fashion, before yielding the actual objective function value.In general, the objective function has the form J(u) = j 3 u, E y j 2 (u, y, E x [j 1 (u, x, y)]) and can easily involve even more convoluted terms in more advanced settings.
On the one hand, the computational cost of deterministic approaches to such problems range from tremendous to infeasible, since j 1 is typically not easy to evaluate.On the other hand, standard schemes from stochastic optimization, like the ones mentioned above, simply cannot solve the full problem.Thus, we are in the need for a general method to tackle optimization problems, which are given by arbitrary concatenations of nonlinear functions and expectation values.

Properties of CSG
As mentioned in the previous section, the CSG method aims to offer a new approach to optimization problems that involve integration of properties, which are expensive to evaluate.In each iteration, CSG draws a very small number (typically 1) of random samples, as is the case for SG.However, instead of discarding the information collected through these evaluations after the iteration, these results are stored.In later iterations, all of the information collected along the way is used to build an approximation to the full objective function and its gradient, by a special linear combination.The weights appearing in this linear combination, which we call integration weights, can be calculated in several different fashions, which are detailed in Section 3.
As a key result for CSG, we are able to show that the approximation error in both the gradient and the objective function approximation vanishes during the optimization process (Lemma 4.6).Thus, in the course of the iterations, CSG more and more behaves like an exact full gradient algorithm and is able to solve optimization problems far beyond the scope of standard SG-type methods.Furthermore, we show that this special behaviour results in CSG having convergence properties similar to full gradient descent methods, while keeping advantages of stochastic optimization approaches, i.e., each step is computationally cheap.To be precise, we prove convergence to a stationary point for constant step sizes (Theorem 5.1) and an Armijo-type line search (Theorem 6.1), which is based on the gradient and objective function approximations of CSG.

Limitations of the Method
While CSG combines advantages of deterministic and stochastic optimization schemes, the hybrid approach also yields drawbacks, which we try to address throughout this contribution.As mentioned earlier, the intended application for CSG lies in expected-valued optimization problems, in which solving the state problem is computationally expensive.In many other settings that heavily rely on stochastic optimization methods, e.g., neural networks, the situation is different.Here, we can efficiently obtain a stochastic (sub-)gradient, meaning that we are better off simply performing millions of SG iterations, than a few thousand CSG steps.In these situations, the inherent additional computational effort that lies within the calculation of the CSG integration weights (see Section 3) is no longer negligible and usually can not be compensated by the improved gradient approximation.
Furthermore, the convergence rate of CSG worsens as the dimension of integration increases.While this can be avoided, if the objective function imposes additional structure, it remains a drawback in general.A detailed analysis of this issue can be found in [26].
We emphasize that CSG and SG-type methods share many similarities.However, their intended applications are complementary to each other, with SG preferring objective functions of simple structure and computationally cheap sampling, while CSG prefers the opposite.

Structure of the Paper
Section 2 states the general framework of this contribution as well as the basic assumptions we impose throughout the paper.Furthermore, the general CSG method is presented.
The integration weights, which play an important role in the CSG scheme, are detailed in Section 3. Therein, we introduce four different methods to obtain weights which satisfy the necessary assumptions made in Section 2 and analyze some of their properties.The section also describes techniques to implement mini-batches in the CSG method.
Auxiliary results concerning CSG are given in Section 4. This includes the gradient approximation property of CSG (Lemma 4.6) and a numerical example for the generalized setting of the optimization problem (Section 4.2).The first part of the convergence theory, i.e., convergence for constant steps (Theorem 5.1 and Theorem 5.2), is presented in Section 5 and tested on a simple example (Section 5.1).
Afterwards, in Section 6, we incorporate an Armijo-type line search in the CSG method and provide a convergence analysis for the resulting optimization scheme (Theorem 6.1).The theoretical results are additionally tested for an academic example (Section 6.2).
A numerical analysis of CSG, concerning the performance for non-academic examples and convergence rates, is not part of this contribution, as this can be found in [26].

Setting and Assumptions
In this section, we introduce the general setting and formulate the assumptions made throughout this contribution.Additionally, the basic CSG algorithm is presented and some preliminary results are stated.

Setting
As mentioned above, the convergence analysis of the CSG method is carried out in a simplified setting in order to shorten notation and improve the overall readability.For the general case, see Remark 4.1.
Definition 2.1 (Objective Function) For do, dr ∈ N, we introduce the set of admissible optimization variables U ⊂ R do and the parameter set X ⊂ R dr .The objective function J : U → R is then given by where we assume j ∈ C 1 (U × X ; R) to be measurable and X ∼ µ.
The (simplified) optimization problem is then given by min u∈U X j(u, x)µ(dx).
Since U and X are finite dimensional, we do not have to consider specific norms on these spaces due to equivalence of norms and can instead choose them problem specific.In the following, we will denote them simply by • U and • X .
During the optimization, we need to draw independent random samples of the random variable X, as stated in the following assumption.It is important to note, that a sample sequence satisfying Assumption 2.1 is dense in supp(µ) with probability 1.This can be seen as follows: Let x ∈ supp(µ) and ε > 0.Then, given an independent identically distributed sequence x 1 , x 2 , . . .∼ µ, we have Hence, by the Borel-Cantelli Lemma [27, Theorem 2.7], P xn ∈ Bε(x) for all n ∈ N = 0. Thus, the sequence (xn) n∈N is dense in supp(µ) with probability 1.
Remark 2.2 (Almost sure convergent results) The (almost sure) density of the sample sequence plays a crucial role in the upcoming proofs.Hence, the convergence results presented in this contribution all hold in the almost sure sense.However, to improve the readability, we refrain from always mentioning this explicitly.
Moreover, the sets U and X need to satisfy the following regularity conditions: Assumption 2.2 (Regularity of U, X and µ) The set U ⊂ R do is compact and convex.The set X ⊂ R dr is open and bounded with supp(µ) ⊂ X .
Notice that the second part of Assumption 2.2 can always be achieved, as long as supp(µ) ⊂ R dr is bounded.
Finally, as in the deterministic case, we assume the gradient of the objective function to be Lipschitz continuous, in order to obtain convergence for constant step sizes.
Assumption 2.3 (Regularity of j) The function ∇ 1 j : U × X → R do is bounded and Lipschitz continuous, i.e., there exists constants C, L j ∈ R >0 such that for all u, u 1 , u 2 ∈ U and x, x 1 , x 2 ∈ X .Remark 2.3 (Regularity of ∇J) By Assumptions 2.1 to 2.3, J : U → R is L-smooth, i.e., there exists L > 0 such that

The CSG Method
Given a starting point u 1 ∈ U and a random parameter sequence x 1 , x 2 , . . .according to Assumption 2.1, the basic CSG method is stated in Algorithm 1.In each iteration, the inner objective function j and gradient ∇ 1 j are evaluated only at the current design-parameter-pair (u n , x n ).Afterwards, the integrals appearing in J and ∇J are approximated by a linear combination, consisting of all samples accumulated in previous iterations.
The coefficients (α k ) k=1,...,n appearing in Step 4 of Algorithm 1, which we call integration weights, are what differentiates CSG from other stochastic optimization schemes.In [1], where the main idea of the CSG method was proposed for the first time, a special choice of how to calculate these weights was already presented.A recap of this procedure as well as several new methods to obtain the integration weights is given in detail in Section 3.
Furthermore, P U appearing in Step 8 of Algorithm 1 denotes the orthogonal projection (in the sense of • U ), i.e., The final general assumption we will use throughout this contribution is related to the integration weights mentioned above.
Assumption 2.4 (Integration Weights) Denote the sequence of designs and random parameters generated by Algorithm 1 until iteration n ∈ N by (u k ) k=1,...,n and (x k ) k=1,...,n , respectively.Define k n : X → {1, . . ., n} as Then, for all n ∈ N, there exists a probability measure µn on X such that Here, α (n) k k=1,...,n denotes the integration weights in iteration n ∈ N, while δ k n (x) (k) corresponds to the Dirac measure of k n (x).

The Continuous Stochastic Gradient Method
There is a very simple idea hidden behind the technicalities of Assumption 2.4.Condition (3) states that the integration weights should be somehow based on a nearest neighbor approximation while the condition µ n ⇒ µ ensures that the weight of a sample is reasonably chosen, i.e., Due to the finite dimensional setting, all convergence results proven in this contribution hold independent of the chosen norm on U × X implied by (2).However, the specific choice may strongly influence the behavior (see, Section 3.5) and performance of CSG.Further insight on the integration weights and multiple methods to obtain weights satisfying Assumption 2.4 are given in the following Section 3.

Algorithm 1 General CSG Method
while Termination condition not met do Sample objective function (optional): The general CSG method as proposed in [1] for the simplified setting (1).Further information on how to carry out the integration weight calculation in practice is given in Section 3.
Fig. 1 The grey dots represent our nine sample points (u i , x i ) ∈ U × X .The full gradient of the objective function is obtained by integrating ∇ 1 j(u 9 , •) along the blue line.

Integration Weights
We start with a brief motivation: Suppose that we are currently in the 9th iteration of the CSG algorithm.So far, we have sampled ∇ 1 j(u, x) at nine points (u i , x i ) i=1,...,9 and the full gradient at the current design is given by In Figure 1, the situation is shown for the case d r = d o = 1.How can we use the samples ∇ 1 j(u i , x i ) i=1,...,9 in an optimal fashion, to build an approximation to ∇J(u 9 )?
We will present four different methods of calculating the integration weights (α i ) i=1,...,n , each with their own benefits and downsides.

Exact Weights
To approximate the value of the integral along the bold line, we may use a nearest neighbor approximation.The underlying idea is visualized in Figure 2. Thus, we define the sets By construction, M k contains all points x ∈ X , such that (u n , x) is closer to (u k , x k ) than to any other previous point we evaluated ∇ 1 j at.The full gradient is then approximated in a piecewise constant fashion by Fig. 2 For the exact integration weights, α i represents to the measure of the line segment that lies in the Voronoi cell around (u i , x i ).The grey cells correspond to samples with integration weight 0.
Therefore, we call α ex k = µ(M k ) exact integration weights, since they are based on an exact nearest neighbor approximation.These weights were first introduced in [1] and offer the best approximation to the full gradient.However, the calculation of the exact integration weights requires full knowledge of the measure µ and is based on a Voronoi tessellation [28], which is computationally expensive for high dimensions of U × X .
Note that, in the special case dim(X ) = 1, the calculation of the tessellation can be circumvented, regardless of dim(U).Instead, the intersection points of the line {u n } × X and all faces of active Voronoi cells can be obtained directly by solving the equations appearing in the definition of M k .This, however, still requires us to solve O(n 2 ) quadratic equations per CSG iteration.

Exact Hybrid Weights
In some settings, the dimension of X can be very small compared to the dimension of U. Hence, we might avoid computing a Voronoi tessellation in U × X by treating these spaces separately.For this, we introduce the sets Denoting by 1 M k the indicator function of M k , we define the exact hybrid weights as The red stars on the bold line represent the points (un, x i ).For each cell, we determine the number of these points inside the cell and combine their corresponding measures µ M i , indicated by the colored line segments.Since dim(X ) = 1 in the shown example, the sets M i are simply intervals, where the endpoints are given by the midpoints of two neighboring x i .
Notice that the sets M k still appear in the definition of the exact hybrid weights, but do not need to be calculated explicitly.Instead, we only have to find the nearest sample point to (u n , x i ) i=1,...,n , which can be done efficiently even in large dimensions.We do, however, still require knowledge of µ.The idea of the exact hybrid weights is captured in Figure 3.

Inexact Hybrid weights
To calculate the integration weights in the case that the measure µ is unknown, we may approximate µ M i empirically.All that is required for this approach is additional samples of the random variable X.To be precise, we draw enough samples such that in iteration n, we have in total f (n) samples of X, where f : N → N is a function which is strictly increasing and satisfies f (n) ≥ n for all n ∈ N. It is important to note, that we still evaluate ∇ 1 j(u n , •) at only one of these points, which we denote by x jn .Thus, exchanging µ M i by its empirical approximation yields the inexact hybrid weights How fast f grows determines not only the quality of the approximation, but also the computational complexity of the weight calculation.Based on the choice of f , the inexact hybrid weights interpolate between the exact hybrid weights and the empirical weights, which will be introduced below.In Figure 5, this behavior is shown for functions of the form f (n) = n β with β ≥ 1. Figure 4 illustrates the general concept of the inexact hybrid weights.

Empirical Weights
The empirical weights offer a computationally cheap alternative to the methods mentioned above.Their calculation does not require any knowledge of Fig. 4 The red stars on the bold line represent the points (un, x i ) i=1,...,f (n) .Similar to the exact hybrid weights, for each cell of the Voronoin diagramm, we count the points (un, x j i ) i=1,...,n inside the cell (marked by dashed lines).Instead of assigning them the weights µ M j i , we weight them by an empirical approximation to this quantity, i.e., by the percentage of random samples Fig. 5 Median of the results for problem (19), initialized with 1000 random starting points, for empirical weights, exact hybrid weights and several inexact hybrid weights.
The function f appearing in the inexact hybrid weights was chosen as the measure µ.For the empirical weights, the quantity µ(M k ) is directly approximated by its empirical counterpart, i.e., The corresponding picture is shown in Figure 6.By iteratively storing the distances x i − x j , i, j < n, the empirical weights can be calculated very efficiently.

Metric on U × X
As mentioned after Assumption 2.4, the convergence results proven in this contribution are independent of the specific inner norms on U and X , denoted by • U and • X .Furthermore, they also hold if we substitute the outer norm on U × X , appearing in (2), e.g., by a generalized L 1 -outer norm, i.e., (u, x) with c 1 , c 2 > 0. While this does not seem particularly helpful at first glance, it in fact allows to drastically change the practical performance of CSG.By altering the ratio ξ = c2 c1 , the CSG gradient approximation tends to consider Fig. 6 The empirical weights assign each line segment to its empirical measure, i.e., the fraction of points (un, x i ) (red stars) which lie inside the corresponding cell.
Fig. 7 Voronoi diagrams for the integration weight calculation, given the sample points (u i , x i ) i=1,...,9 .Different ratios ξ in the norm on U × X lead to SG-like behavior (left, ξ = 0.01) or averaging over all samples (right, ξ = 100).
fewer (ξ < 1) or more (ξ > 1) old samples in the linear combination.The effect such a choice can have in practice is visible in the numerical analysis of CSG in [26].
To be precise, choosing ξ 1 results in the nearest neighbor being predominantly determined by the distance in design.In the extreme case, this means that CSG initially behaves more like a traditional SG algorithm.Analogously, ξ 1 will initially yield a gradient approximation, in which all old samples are used, even if they differ greatly in design (for discrete measure µ, this corresponds to SAG).The corresponding Voronoi diagrams are shown in Figure 7.
For the sake of consistency, all numerical examples will use the euclidean norm • 2 as inner norms and we fix ξ = 1.

Properties of the Integration Weights
The integration weights gained by each of the methods mentioned above can be regarded as a special probability measure µ n on the space X , which we use to approximate the full integral.For example, the empirical weights correspond to the empirical measure (see, e.g., [29]) where as in Assumption 2.4.It was shown in [30,Theorem 3], that µ e n ⇒ µ for n → ∞, where ⇒ denotes the weak convergence of measures (or equivalently, the weak-* convergence in dual space theory, see, e.g., [31,Section 7.3]): Likewise, the measures associated with the other integration weights n ⇒ µ and µ ih n ⇒ µ as well.This property plays an important role in the proof of Lemma 4.6, as it ensures a vanishing Wasserstein distance of µ n and µ for n → ∞, i.e. d W (µ n , µ) → 0 for all of the presented integration weights, see [32,Theorem 6].

Batches, Patches and Parallelization
All methods for obtaining the integration weights presented above heavily relied on evaluating the gradient ∇ 1 j only at a single sample point (u n , x n ) per iteration.In other words, Algorithm 1 has a natural batch size of 1.While we certainly do not want to increase this number too much, stochastic mini-batch algorithms outperform classic SG in some instances, especially if the evaluation of the gradient samples ∇ 1 j u n , x (i) n i=1,...,N can be done in parallel.
Increasing the batch size N in CSG leaves the question of how the integration weights should be obtained.We could, of course, simply collect all evaluation points and calculate the weights as usual.This, however, may significantly increase the computational cost, effectively scaling it by N .In many instances, there is a much more elegant solution to this problem, which lets us include mini-batches without any additional weight calculation cost.
Assume, for simplicity, that X is an open interval and that µ corresponds to a uniform distribution, i.e., there exist a < b ∈ R such that Given N ∈ N, we obtain Thus, we can include mini-batches of size N ∈ N into CSG by performing the following steps in each iteration: 1.) Draw random sample x n ∈ a, a + b−a N .2.) Evaluate ∇ 1 j(u n , •) at each

3.) Compute
4.) Calculate integration weights (α k ) k=1,...,n as usual for (u k , x k ) k=1,...,n and set Ĝn = It is straightforward to apply this idea to higher dimensional rectangular cuboids.Furthermore, the process of subdividing X into smaller patches and drawing the samples in only one of these regions X 1 ⊂ X can be generalized to more complex settings as well.However, it is necessary that translating the sample x n into the other patches preserves the underlying probability distribution X ∼ µ, e.g., if µ is induced by a Lipschitz continuous density function and the different patches are obtained by reflecting, translating, rotating and scaling X 1 .A conceptual example is shown in Figure 8.
The effect of introducing mini-batches into CSG is tested for the optimization problem min by dividing 2 into small squares of sidelength 1 N , achieving a batch size of N 2 .The results of 500 optimization runs with random starting points are given in Figure 9.Although increasing the batch size does, as expected, not influence the rate of convergence, it still improves the overall performance and should definitely be considered for complex optimization problems, especially if the gradient sampling can be performed in parallel.
x n Fig. 8 To obtain a mini-batch of six samples points, the domain X is subdivided into six patches, indicated by the blue lines.In each iteration, a sample point xn is drawn in patch X 1 (grey) and afterwards translated into all other patches.Fig. 9 Median distance of the CSG iterates to the optimal solution u * = 0 ∈ R 2 of (4) in each iteration.The batch size used is equal to N 2 , where different values of N are indicated by the different colors.For all runs, a constant step size of τ = 1 2 was chosen.

Auxiliary Results
From now on, unless explicitly stated otherwise, we will always assume Assumptions 2.1 to 2.4 to be satisfied.We will later show that the CSG method converges to stationary points, which we define next.Gradient descent methods for L-smooth objective functions have thoroughly been studied in the past (e.g.[33,34]).The key ingredients for obtaining convergence results with constant step sizes are the descent lemma and the characteristic property of the projection operator, which we state in the following.Before we move on to results that are specific for the CSG method, we state a general convergence result, which will be helpful in the later proofs.Next up, we show that there exists N ∈ N such that for all n ≥ N it holds un − ū(n) < δ0 4 .We prove this by contradiction.Thus, we assume there exist infinitely many n ∈ N such that un − ū(n) ≥ δ0 4 .This subsequence is again bounded and therefore must have an accumulation point.By construction, this accumulation point is no accumulation point of (un) n∈N , which is a contradiction.Now, let N 1 ∈ N be large enough such that un − ū(n) < δ0 4 for all n ≥ N 1 .By our assumptions, there also exists N 2 ∈ N with u n+1 − un < δ0 4 for all n ≥ N 2 .Define N := max{N 1 , N 2 }.

Results for CSG Approximations
From now on, let (u n ) n∈N denote the sequence of iterates generated by Algorithm 1.In this section, we want to show that the CSG approximations Ĵn and Ĝn in the course of iterations approach the values of J(u n ) and ∇J(u n ), respectively.This is a key result for the convergence theorems stated in Section 5 and Section 6. Proof Utilizing the compactness of supp(µ) ⊂ R dr , there exists T ∈ N such that is an open cover of supp(µ) consisting of balls with radius ε 2 centered at points m i ∈ supp(µ).Thus, for each x ∈ supp(µ) we can find ix ∈ {1, . . ., T } with x ∈ B ε/2 (m ix ).Hence, by Remark 2.1, for each i = 1, . . ., T , there exists For i ∈ I let u n (i) t t∈N be the subsequence consisting of all elements of (un) n∈N that lie in B ε/4 (m i ).Observe that x n (i) t t∈N is independent and identically distributed according to µ, since for all A ⊆ supp(µ) and each i ∈ I, it holds Thus, by Remark 2.1, x n (i) t t∈N is dense in supp(µ) with probability 1 for all i ∈ I.By Lemma 4.4, we can find Define as well as N := max(N 1 , N 2 ).Notice that this definition of N implies for all i ∈ I and all n > N n (i) t : t ∈ {1, . . ., K i } ⊆ {1, . . ., n}.
By (5), for all n > N there exists i ∈ I such that un − m i U < ε 4 .Now, given x ∈ supp(µ) and n > N , we choose j ∈ I such that un ∈ B ε/4 (m j ).Thus, it holds ε, where we used (6) and un, u n (j) t ∈ B ε/4 (m j ) in the last line.Proof Denote by νn the measure corresponding to the integration weights according to (3).For x ∈ supp(µ), we define the closest index in the current iteration k n (x) as i.e., we have where L j is the Lipschitz constant of ∇ 1 j as defined in Assumption 2.3.First, since Zn is uniformly (in n) Lipschitz continuous, we obtain Here, L Z corresponds to the Lipschitz constant of Zn and d W denotes the Wasserstein distance of the measure νn and µ.By Assumption 2.2, X is bounded and by Assumption 2.4, we have νn ⇒ µ.Thus, [32,Theorem 6] yields d W (νn, µ) → 0. Additionally, since Zn is bounded and converges pointwise to 0 (see Lemma 4.5), we use Lebesgue's dominated convergence theorem and conclude For the second part, let Qn be an arbitrary coupling of νn and µ, i.e., Qn(•×X ) = νn and Qn(X × •) = µ.Utilizing the Lipschitz continuity of ∇ 1 j (Assumption 2.3) once again, we obtain x − y X Qn(d(x, y)).
Denote the set of all couplings of νn and µ by Q.Since Qn was arbitrary, it holds finishing the proof of Ĝn − ∇J(un) → 0. The second part of the claim follows analogously.
As a final remark before starting the convergence analysis, we want to give further details on the class of problems that can be solved by the CSG algorithm.
Remark 4.1 (Generalized Setting) Suppose that, in addition to U, X and J as defined in the introduction, we are given a convex set V ⊂ R d1 for some d 1 ∈ N and a continuously differentiable function F : V × R → R. Now, if we consider the optimization problem min the gradient of the objective function with respect to (u, v) is given by .
Thus, we can use the CSG method to solve ( 7) and all our convergence results carry over to this setting, as long as the new objective function satisfies Assumption 2.3.Furthermore, let Y ⊂ R d2 for some d 2 ∈ N. Assume that we are given a probability measure ν such that the pair (Y, ν) satisfies the same assumptions we imposed on (X , µ) and consider the optimization problem min Again, the gradient of this objective function can be approximated by the CSG method, if F : It is clear that we can continue to wrap around functions or expectation values in these fashions indefinitely.Therefore, we see that the scope of the CSG method is far larger than problems like (1) and includes many settings, which stochastic gradient descent methods can not handle, like nested expected values, tracking of expected values and many more.

Example for a Composite Objective Function
To study the performance of CSG in the generalized setting, we consider an optimization problem in which the objective function is not of the type (1), but instead falls in the broader class of possible settings mentioned in Remark 4. The optimal solution u * = π 2 2 to (9) can be found analytically.The nonlinear fashion in which the inner integral over X enters the objective function prohibits us from using SG-type methods to solve (9).There is, however, the possibility to use the stochastic compositional gradient descent method (SCGD), which was proposed in [19] and is specifically designed for optimization problems of the form (9). Each SCGD iteration consists of two main steps: The inner integral is approximated by samples using iterative updating with a slow-diminishing step size.This approximation is then used to carry out a stochastic gradient descent step with a fast-diminishing step size.
For numerical comparison, we choose 1000 random starting points in [ 11  2 , 19  2 ], i.e., the right half of U. In our tests, the accelerated SCGD method (see [19]) performed better than basic SCGD, mainly since the objective function of ( 9) is strongly convex in a neighborhood of u * .Thus, we compare the results obtained by CSG to the aSCGD algorithm, for which we chose the optimal step sizes according to [19,Theorem 7].For CSG, we chose a constant step size τ = 1  30 , which represents a rough approximation to the inverse of the Lipschitz constant L. The results are given in Figure 10.
Furthermore, we are interested in the number of steps each method has to perform such that the distance to u * lies (and stays) within a given tolerance.Thus, we also analyzed the number of steps after which the different methods obtain a result within a tolerance of 10 −1 in at least 90% of all runs.The results are shown in Figure 11.
Fig. 10 Absolute error un − u * during the optimization process.From top to bottom: aSCGD (red), CSG with empirical weights (cyan), CSG with inexact hybrid weights with f (n) = n 1.5 (green) and CSG with exact hybrid weights (blue).The shaded areas indicate the quantiles P 0.1,0.9(light) and P 0.25,0.75(dark).Fig. 11 Minimum number of steps needed for the different algorithms such that at least 90% of the runs achieve an absolute error smaller than the given tolerance.the colors are chosen in the same fashion as in Figure 10.The exact numbers for a tolerance of 10 −1 are 42 (exact hybrid), 76 (inexact hybrid), 440 (empirical) and 1382 (aSCGD).

Convergence Results For Constant Step Size
Our first result considers the special case in which the objective function J appearing in (1) has only finitely many stationary points on U. The proof of this result serves as a prototype for the later convergence results, as they share a common idea.Then CSG with a positive constant step size τn = τ < 2 L converges to a stationary point of J.
We want to sketch the proof of Theorem 5.1, before going into details.In the deterministic case, Lemma 4.1 and Lemma 4.2 are used to show that J(u n+1 ) ≤ J(u n ) for all n ∈ N. It then follows from a telescopic sum argument that u n+1 − u n → 0, i.e., every accumulation point of (u n ) n∈N is stationary (compare [34,Theorem 5.1] or [35,Theorem 10.15]).
In the case of CSG, we can not guarantee monotonicity of the objective function values (J(u n )) n∈N .Instead, we split the sequence into two subsequences.On one of these subsequences, we can guarantee a decrease in function values, while for the other sequence we can not.However, we prove that the latter sequence can accumulate only at stationary points of J.The main idea is then that (u n ) n∈N can have only one accumulation point, because "switching" between several points conflicts with the decrease in function values that must happen for steps in between.
Proof of Theorem 5.1 By Lemma 4.1 we have Utilizing Lemma 4.2 and the Cauchy-Schwarz inequality, we now obtain Since L 2 − 1 τ < 0, our idea is the following: Steps that satisfy i.e., steps with small errors in the gradient approximation, will yield decreasing function values.On the other hand, the remaining steps will satisfy u n+1 − un U → 0, due to ∇J(un) − Ĝn → 0 (see Lemma 4.6).With this in mind, we distinguish three cases: In Case 1, ( 11) is satisfied for almost all steps, while in Case 2 it is satisfied for only finitely many steps.In the last case, there are infinitely many steps satisfying and infinitely many steps violating (11).The Continuous Stochastic Gradient Method In this case, it follows from (10) that J(u n+1 ) ≤ J(un) for all n ≥ N .Therefore, the sequence (J(un)) n∈N is monotonically decreasing for almost every n ∈ N. Since J is continuous and U is compact, J is bounded and we therefore have J(un) → J ∈ R. Thus, it holds Since < 0, we must have u n+1 − un U → 0. Let (un k ) k∈N be a convergent subsequence with un k → u ∈ U .By Lemma 4.6, we have Ĝn k → ∇J(u) and thus i.e., every accumulation point of (un) n∈N is stationary.Since J has only finitely many stationary points, Lemma 4.3 yields the convergence of (un) n∈N to a stationary point of J.In this case, we split (un) n∈N disjointly in the two sequences (u a(n) ) n∈N and (u b(n) ) n∈N , such that we have We call (u a(n) ) n∈N the sequence of descent steps.For (u b(n) ) n∈N , observe that, as in Case 2, we directly obtain and every accumulation point of (u b(n) ) n∈N is stationary.Therefore, as in the proof of Lemma 4.3, for all ε > 0 there exists N ∈ N such that min i∈{1,...,K} where u 1 , . . ., u K denote the K ∈ N accumulation points of (u b(n) ) n∈N .Now, we prove by contradiction that J(u 1 ) = J(u 2 ) = . . .= J(u K ).Suppose that this is not the case.Then we have at least M ≥ 2 function values of accumulation points and where c L denotes the Lipschitz constant of J.By ( 12) and ( 13), there exists N ∈ N such that for all n ≥ N we have Therefore, for n ≥ N and i ∈ {1, . . ., K}, we have Especially, for all n ≥ N and all i = 1, . . ., K it holds It follows from ( 14) and ( 16) that for n ≥ N + 1:  14) and ( 16), as moving from the vicinity of u i to a neighborhood of u j requires that there is an intermediate step un with min i∈{1,...,K} u b(n) − u i U ≥ ε 4 .Similarly, (B) is just the second condition in (16) reformulated.Now, let i ∈ {1, . . ., K} be chosen such that J(u i ) ≤ f 2 and let n 0 ≥ N be chosen such that u b(n0) ∈ B ε 4 (u i ) and u b(n0)+1 ∈ B ε 4 (u i ).Using (15) and (17), we obtain Therefore, descent steps can never reach B ε 4 n ≥ b(n 0 ) + 1, in contradiction to J −1 ({f 1 }) ∩ {u 1 , . . ., u K } consisting of at least one accumulation point of (un) n∈N .Hence, we have Remark 5.1 Comparing Theorem 5.1 and Theorem 5.2, observe that under the weaker assumptions on the set of stationary points of J, we no longer obtain convergence for the whole sequence of iterates.To illustrate why that is the case, consider the function J : R do → R given by J(u) = cos(π u 2 2 ) and U = {u ∈ R do : u 2 2 ≤ 3 2 }.Then, S(J) = {0} ∪ {u ∈ U : u 2 = 1}, i.e., every point on the unit sphere is stationary.Thus, we can not use Lemma 4.3 at the end of the proof to obtain convergence of (un) n∈N .Theoretically, it might happen that the iterates (un) n∈N cycle around the unit sphere, producing infinitely many accumulation points, all of which have the same objective function value.This, however, did not occur when testing this example numerically.
Remark 5.2 While the assumption in Theorem 5.2 seems unhandy at first, there is actually a rich theory concerning such properties.For example, Sard's Theorem [36] and generalizations [37] give that the assumption holds if J ∈ C do and U has smooth boundary.Even though it can be shown that there exist functions, which do not satisfy the assumption (e.g.[38,39]), such counter-examples need to be precisely constructed and will most likely not appear in any application.(R2) Every accumulation point of (u a(n) ) n∈N is stationary.
Assume first, that (R1) does not hold.Then there exist two stationary points u 1 = u 2 with J(u 1 ) < J(u 2 ).Now, (A) and (B) shown in the proof of Theorem 5.1 yield that there must exist an accumulation point u 3 of (u b(n) ) n∈N , i.e., a stationary point, with J(u 1 ) < J(u 3 ) < J(u 2 ).Iterating this procedure, we conclude that the set N ∩ J(u 1 ), J(u 2 ) is dense in J(u 1 ), J(u 2 ) .By continuity of u → P U u − τ ∇J(u) − u and compactness of U, we see that N ∩ J(u 1 ), J(u 2 ) = J(u 1 ), J(u 2 ) , contradicting our assumption that λ(N ) = 0.For (R2), assume that (u a(n) ) n∈N has a non-stationary accumulation point u.Since S(J) is compact, it holds dist({u}, S(J)) > 0.
Thus, by the same arguments as in Case 3.1, 3.2 and 3.3 within the proof of Theorem 5.1, we observe that such a point u can not exist.It is easy to see that ( 19) has a unique solution u * = 0. Furthermore, the objective function is L-smooth (with Lipschitz constant 1) and even strictly convex.Thus, by Theorem 5.1, the CSG method with a constant positive step size τ < 2 produces a sequence (u n ) n∈N that satisfies u n → 0. However, even in this highly regular setting, the commonly used basic SG method does not guarantee convergence of the iterates for a constant step size.

Academic Example for Constant Step Size
To demonstrate this behavior of both CSG and SG, we draw 2000 random starting points u 0 ∈ U and compare the iterates produced by CSG and SG with five different constant step sizes (τ ∈ {0.01, 0.1, 1, 1.9, 1.99}).The CSG integration weights were calculated using the empirical method, i.e., the computationally cheapest choice.The results are shown in Figure 12.
As expected, the iterates produced by the SG method do not converge to the optimal solution, but instead remain in a neighborhood of u * .The radius of said neighborhood depends on the choice of τ and decreases for smaller τ , see [40,Theorem 4.6].

Backtracking
As Ĝn − ∇J(u n ) → 0 and Ĵn − J(u n ) → 0 for n → ∞, we can use these approximations to refine the steplength by a backtracking line search method.Furthermore, given n gradient samples ∇ 1 j(u i , x i ) and n cost function samples j(u i , x i ), by calculating the weights α  iteration with a presdescribed step size η n > 0, based on which the calculation of the true step size τ n is performed (see Algorithm 2).We want to test the stability of both methods with respect to the initially chosen step length.For this purpose, we set η n = τ0 n d , where τ 0 > 0, n is the iteration count and d ∈ [0, 1] is fixed.
For each combination of τ 0 and d, we choose 1200 random starting points in U and perform 500 optimization steps with both AdaGrad and backtracking CSG.Again, the integration weight calculation in bCSG was carried out using the empirical method, leading to a faster weight calculation while decreasing the overall progress per iteration performance.The median of the absolute error u 500 − u * after the optimization, depending on d and τ 0 , is presented in Figure 13.
While there are a few instances where AdaGrad yields a better result than backtracking CSG, we observe that the performance of AdaGrad changes rapidly, especially with respect to the parameter d.The backtracking CSG method on the other hand performs superior in most cases and is much less dependent on the choice of parameters.

Estimations for the Lipschitz Constant of ∇J
We have already seen, that the Lipschitz constant L of ∇J is closely connected with efficient bounds on the step sizes.However, in general, we can not expect to have any knowledge of L a priori.Thus, we are interested in an approximation of L, that can be calculated during the optimization process.
Investigating the proof of Lemma 4.1 in [35] J(u 1 ) = J(u 2 ) + To motivate our approach, assume that J is twice continously differentiable.In this case, a possible approximation to the constant C n in iteration n is ∇ 2 J(u n ) .Therefore, utilizing the previous gradient approximations, we obtain Then, C −1 n yields a good initial step size for our backtracking line search.To circumvent high oscillation of C n , which may arise from the approximation errors of the terms involved, we project C n onto the interval [C min , C max ] ⊂ R, where 0 < C min < C max < 2 T +1 L , i.e., If possible, C min and C max should be chosen according to information concerning L. However, tight bounds on these quantities are not needed, as long as T is chosen large enough.The resulting SCIBL-CSG (SCaling Independent Backtracking Line search) method is presented in Algorithm 4. Notice that SCIBL-CSG does not require any a priori choice of step sizes and yields the same convergence results as bCSG.

Conclusion and Outlook
In this contribution, we provided a detailed convergence analysis of the CSG method.The calculation of the integration weights was enhanced by several new approaches, which have been discussed and generalized for the possible implementation of mini-batches.
We provided a convergence proof for the CSG method when carried out with a small enough constant step size.Additionally, it was shown that CSG can be augmented by an Armijo-type backtracking line search, based on the gradient and objective function approximations generated by CSG in the course of the iterations.The resulting bCSG scheme was proven to converge under mild assumptions and was shown to yield stable results for a large spectrum of hyperparameters.Lastly, we combined a heuristic approach for approximating the Lipschitz constant of the gradient with bCSG to obtain a method that requires no a priori step size rule and almost no information about the optimization problem.
For all CSG variants, the stated convergence results are similar to convergence results for full gradient schemes, i.e., every accumulation point of the sequence of iterates is stationary and we have convergence in objective function values.Furthermore, as is the case for full gradient methods, if the optimization problem has only finitely many stationary points, the presented CSG variants produce a sequence which is guaranteed to converge to one of these stationary points.However, none of the presented convergence results for CSG give any indication of the underlying rate of convergence.Furthermore, while the performance of all proposed CSG variants was tested on academic examples, it is important to analyze how they compare to algorithms from literature and commercial solvers, when used in real world applications.
Detailed numerical results concerning both of these aspects can be found in [26].

Assumption 2 . 1 (
Sample Sequence) The sequence of samples (xn) n∈N is a sequence of independent identically distributed realizations of the random variable X ∼ µ.

Definition 4 . 1 (
Stationary Points) Let J ∈ C 1 (U) be given.We say u * ∈ U is a stationary point of J, if there exists t > 0 such thatP U u * − t∇J(u * ) = u * .Furthermore, we denote by S(J) := {u ∈ U : P U (u − t∇J(u)) = u for some t > 0} the set of all stationary points of J.

Lemma 4 . 6 (
Approximation Results for J and ∇J) The approximation errors for ∇J and J vanish in the course of the iterations, i.e., Ĝn − ∇J(un) → 0 and Ĵn − J(un) → 0 for n → ∞.The Continuous Stochastic Gradient Method

Theorem 5 . 1 (
Convergence for Constant Steps) Assume that J has only finitely many stationary points on U.
) and u b(n+1) ∈ B ε 4 (u j ) for some j = i, then there must be at least one descent step between u b(n) and u b(n+1) .(B)If u b(n) ∈ B ε 4 (u i ) and u b(n)−1 ∈ B ε 4 (u i ), then u b(n)−1 must be a descent step.Observe that (A) follows directly from (

Proof of Theorem 5 . 2
Proceeding analogously as in the proof of Theorem 5.1, we only have to adapt two intermediate results in Case 3: (R1) The objective function values of all accumulation points of (u b(n) ) n∈N are equal.

Fig. 12
Fig.12Comparison of the iterates produced by 500 steps of SG (red / first row) and CSG (green / second row) with 2000 random starting points u 0 ∈ U.Both methods have been tested for the five different constant step sizes τ ∈ {0.01, 0.1, 1, 1.9, 1.99} (first to fifth column).The shaded areas indicate the quantiles P 0.1,0.9(light) and P 0.25,0.75(dark), while the solid line represents the median of the 2000 runs.
(n) i (u) w.r.t. a given point u ∈ U , we define Jn
(21)rithm 4 SCIBL-CSGGiven u 0 ∈ U, while Termination condition not met do Sample objective function:j n := j(u n , x n ) Sample gradient: g n := ∇ 1 j(u n , x n ) Calculate weights α k Calculate C n by(21).Calculate step size τ n by Algorithm 2 with start at1Cn .Gradient step:u n+1 := P U u n − τ n Ĝn Update index: n = n + 1 end while