The Continuous Stochastic Gradient Method: Part II -- Application and Numerics

In this contribution, we present a numerical analysis of the continuous stochastic gradient (CSG) method, including applications from topology optimization and convergence rates. In contrast to standard stochastic gradient optimization schemes, CSG does not discard old gradient samples from previous iterations. Instead, design dependent integration weights are calculated to form a linear combination as an approximation to the true gradient at the current design. As the approximation error vanishes in the course of the iterations, CSG represents a hybrid approach, starting off like a purely stochastic method and behaving like a full gradient scheme in the limit. In this work, the efficiency of CSG is demonstrated for practically relevant applications from topology optimization. These settings are characterized by both, a large number of optimization variables \textit{and} an objective function, whose evaluation requires the numerical computation of multiple integrals concatenated in a nonlinear fashion. Such problems could not be solved by any existing optimization method before. Lastly, with regards to convergence rates, first estimates are provided and confirmed with the help of numerical experiments.


Introduction
In this paper, we present a numerical analysis of the Continuous Stochastic Gradient (CSG) method, which was first proposed in [1].Later, in [2], it was shown that the error in the CSG gradient and objective function approximation vanishes during the course of the iterations.This key property of CSG yields strong convergence results known from classic gradient methods, e.g., convergence of the sequence of iterates for constant step sizes, which are beyond the scope of standard stochastic approaches known from literature, like the Stochastic Gradient (SG) method [3], or the Stochastic Average Gradient (SAG) method [4].
Furthermore, the approximation property of CSG significantly increases the set of possible applications, allowing for more complex structures in the optimization problem than the schemes listed before.While CSG was shown to perform superior to various stochastic optimization approaches on academic examples [2], it remains to see if this is also the case for more involved applications.For this purpose, we consider several optimization problems arising in the context of optimal nanoparticle design.These applications focus on optimization with respect to the resulting color of a particulate product, as it represents one of the most prominent fields of research within this setting [5][6][7][8][9][10].
Moreover, all convergence results stated in [2] provide no insight on the rate of convergence.Since this plays a crucial role for the practicability of CSG, it is of great importance to further analyze this quantity.In this contribution, we propose estimated convergence rates for the general CSG method and verify the numerically.

Structure of the Paper
Section 2 introduces the application from nanoparticle optics, mentioned above.Two different methods to model the particle, varying greatly in computational effort and design dimension, are presented.After detailing the setting and challenges in the low-dimensional optimization problem, we compare the results of the CSG method to different approaches based on the fmincon algorithm provided by MATLAB (Section 2.7).Later on, we analyze the highdimensional problem formulation purely within the CSG framework, since a comparison with generic deterministic optimization schemes is out of scope, due to the associated computational complexity.
Afterwards, Section 3 shortly covers techniques to estimate the gradient approximation error during the optimization, before we focus on the convergence rate of CSG in Section 4. While the expected rates stated therein are not proven, we present detailed numerical examples to solidify our claims.Furthermore, we analyze how the convergence rate depends on the dimension of integration and how to avoid slow convergence, if the objective function admits additional structure.

Nanoparticle Design Optimization
Since the design of a nanoparticle, i.e., its shape, size, material distribution, etc., heavily impacts its optical properties, the task of optimizing a nanoparticle design with respect to a specific optical property arises naturally [11].In this section, we are interested in using hematite nanoparticles to optimize the color of a paint film [12].Thus, we start by introducing our main framework for this application.

Color Spaces
First off, we should explain what optimal color means in our setting.There are several different methods to describe color mathematically, e.g., assigning each color an RGB representation vector v ∈ R 3 , where the three components of v correspond to the red, green and blue value of the color.In our application, we are interested in the color of the paint film as it appears to the human eye.Therefore, the underlying color space should be chosen based on the following property: If the euclidean distance between the representation vectors of two colors is small, the colors should be almost indistinguishable to the human eye.
As it turns out, the RGB color space is a very poor choice with respect to this feature.Hence, we instead choose the CIELAB color space [13], which was introduced by the International Commission of Illumination (Commission Internationale de l'Eclairage, CIE), as it was designed with this exact purpose in mind.The CIELAB representation of a color consists of three values L, a and b.Here, L corresponds to the lightness of a color and ranges from 0 (black) to 100 (white).The values of a and b, typically within the range of ±150, describe the colors position with respect to the opponent color pairs green-red and blue-yellow.A short overview is given in Figure 1.
Another color space, which naturally arises from our setting, is the CIE 1931 XYZ color space [14].The values of X, Y and Z can be calculated by integrating the optical properties of a particle over the spectrum of visible light (400nm -700nm), which we denote by Λ.Each of these integrations is weighted by the corresponding color matching functions x, y, z : Λ → R.
Thus, in our application, we will first calculate the CIE 1931 XYZ representation of the resulting color and then use the (nonlinear) color space transformation Ψ : R 3 → R 3 with Ψ(X,Y,Z) = (L, a, b) , to work in the Utilizing the intended CIE parameters = 216 24389 and κ = 24389 27 , the LAB color values are then given by where f : R → R is defined as otherwise .

Mie Theory and Discrete Dipole Approximation
Given a nanoparticle shape and material, we can use the time-harmonic Maxwell's equations to calculate its optical properties.Specifically, in our setting, we are interested in the absorption (Abs), scattering (Sca) and geometry factor (Geo).The time required and precision achieved are, of course, dependent on our model of the nanoparticle and the method used to solve Maxwell's equations.For our setting, we choose two different approaches.
On the one hand, we will use the discrete dipole approximation (DDA) [15][16][17], in which the particle is discretized into an equidistant grid of dipole cells.Thus, DDA allows the analysis of arbitrary particle shapes and material distributions.The downside lies within the computational complexity of the method, which scales with the total number of dipoles and therefore grows rapidly when increasing the resolution.While the CSG method is still capable of solving the resulting optimization problem in our experiments, the tremendous computational cost associated to the DDA approach severely impede a detailed analysis of the problem.Especially, there is no computationally feasible, generic optimization scheme to compare our results with.However, we want to note that optimization in the DDA model has already been done in a slightly simpler setting, where the full integral over Λ was replaced by summation over a small number of different wavelengths [18].
On the other hand, Mie theory [19,20] provides a numerically cheap alternative, at the price of a more restrictive setting.In Mie theory, one only considers radially symmetric particles.In this special setting, it is possible to find analytic solutions based on series expansions to the time-harmonic Maxwell's equations.Therefore, in our first approach, we will only consider core-shell particles, as the utilization of Mie theory allows for a much deeper analysis of the resulting optimization problem and comparison to deterministic optimization approaches, which rely on discretization of the integrals.

Nanoparticles in Paint Film -Kubelka-Munk Theory
As mentioned above, the XYZ color values of the paint film can be calculated by integration of the corresponding color matching functions x, y, z and the important optical properties of the nanoparticle.The precise method to obtain X, Y and Z is given by the Kubelka-Munk theory [21], augmented by a Saunderson correction [22].For a paint film, in which nanoparticles with design u are present and which is illuminated by light with wavelength λ ∈ Λ, the resulting color can be expressed by the K and S value Now, X, Y and Z can be obtained by where ρ 0 and ρ 1 are material parameters.In our setting, which we introduce in the next section, we have ρ 0 = 0.04 and ρ 1 = 0.6.

Problem Formulation
In our first setting, we consider a radially symmetric core-shell nanoparticle (see Figure 2), where the inner core consists of water, while the outer shell is made of hematite.Thus, the design u consists of the radius R (1nm -75nm) of the core and the thickness d (1nm -250nm) of the outer hematite shell.
As an additional layer of difficulty, we can, in practice, not expect all nanoparticles present in the paint film to be identical copies of design u.Instead, when trying to produce nanoparticles of a specific design in large quantities, one usually ends up with a mixture of particles of different designs, following a certain probability distribution µ u , which is dependent on the intended design u.
We model this aspect by assuming that, given a design u = (R, d), the particles present in the paint film follow a normal distribution (truncated to a reasonable design space R × D) centered around u, i.e., R ∼ N (R, 1  10 R) and d ∼ N (d, 1 10 d).
Therefore, the K and S values in the Kubelka-Munk model need to be replaced by their averaged counterparts and before calculating the reflectance R ∞ (u, λ) and integrating it over Λ.The objective in our application is to produce a paint of bright red color.Thus, the complete optimization problem reads max u∈U 1 20 L(u) + 19  20 a(u). (1)

Challenges
The highly condensed fashion, in which (1) is formulated, may obscure a lot of the difficulties that arise when trying to solve it.To get a better understanding of the problem, let us first analyze the abstract structure of the objective function Since calculating J(u) and ∇J(u) requires integrating the optical properties in multiple dimensions and since evaluating said properties for any combination of R, d and λ requires solving the time-harmonic Maxwell's equations, standard deterministic approaches, e.g., full gradient methods, run into a prediscretization problem.
On the one hand, the number of integration points needs to be sufficiently large for our setting.In Figure 3, a slice through the objective function for a fixed value of R and several different amounts of integration points is shown.While we actually do not care too much about the approximation error resulting from a small number of integration points, the artificial local maxima introduced into the objective function by the discretization severely impact the quality of the optimization.In other words, many solutions to the discretized problem are completely unrelated to solutions to (1).We want to note that, even though not all of the stationary points in Figure 3 correspond to stationary points of (1), the prediscretization still leads to very flat regions in the objective functions, which hinder the performance of many solvers.In Figure 4, this effect is displayed.
On the other hand, the number of integration points is heavily restricted by the computational cost associated to the evaluation of Abs, Sca and Geo.While medium resolutions (25 3 ∼ 15000 points in total) are still numerically tractable for simple Mie particles, they are outright impossible to achieve in the more general DDA setting, which we want to consider later.For comparison: The optimization in [18] was carried out using a discretization consisting of 20 points in total.
We want to emphasize that standard SG-type schemes, or even the Stochastic Composition Gradient Descent (SCGD) method [23], which was used for the comparison for composite objective functions in [2, Section 4.2], are not capable of solving (1), due to the special structure of J.    5 Cumulative density function for R in the case R = 80.The six integration points (red dots) are obtained by dividing (0, 1) in six intervals of equal size and calculating the midpoints of the resulting preimages (black crosses).Note that the preimages are first projected on the 3σ-interval.Fig. 6 Density function for R in the case R = 80.The red dots represent the six integration points as detailed in Figure 5.By their special construction, each shaded region under the curve is of equal area.

Discretization
For the reasons mentioned above, we will only compare the results obtained by CSG to generic deterministic optimization schemes for various choices of discretization.Since the integration over Λ admits no special structure, we always choose an equidistant partition for this dimension of integration.However, for the integration over R × D, we can use our knowledge of µ u to achieve a better approximation to the true integral.Instead of dividing R × D into an equidistant grid, we utilize the fact that R and d are normal distributed independent from each other.Since, for a normal distribution, 99.7% of all weight is concentrated in the 3σ-interval around the mean value, we may only discretize this portion of the full domain in each step.
Moreover, we know the precise density function for both R and d.Thus, given a design n not into equidistant intervals, but instead in intervals of equal weight.This procedure is illustrated in Figures 5 and 6 and produces very good results even for a small number of sample points.
However, as we have already seen in Figure 3, even this dedicated discretization scheme introduces additional propbelms into (1).Furthermore, we want to emphasize that choosing a reasonable discretization is a challenge of its own.Not only is there no a priori indication for the general magnitude of the number of points needed, it is also unclear whether or not one should use the same number of points in each direction.

Numerical Results
As mentioned above, the restriction to radially symmetric nanoparticles allows us to apply standard blackbox solvers to (1), in order to have a comparison for the CSG results.In our case, we chose the fmincon implementation of an interior point algorithm, integrated in MATLAB, as is it an easy-to-use blackbox algorithm that yields reproducible results.
Specifically, we compared the results of SCIBL-CSG with empirical weights on R × D and exact hybrid weights on Λ (cf.[2, Section 3]) to the fmincon results for three different discretization schemes of Λ × R × D. Two of these are equal in each dimension (10 × 10 × 10 and 7 × 7 × 7), while the last one is asymmetric (8 × 2 × 2).Once again, we want to stress that finding an appropriate discretization scheme already requires a thorough analysis of (1).The specific choices listed above represent three of the most promising candidates found during our investigation.As we consider this example to be a prototype for more advanced settings from topology optimization, e.g., switching the setting to the DDA model later, we compare the different approaches with respect to the number of inner gradient evaluations, since this is by far the most time-consuming step in these cases.To be precise, an evaluation represents the calculation of Abs, Sca, Geo, ∇ Abs, ∇ Sca and ∇ Geo for a single (λ, R, d) ∈ Λ × R × D.
Since the produced iterates depend on the initial design, we randomly selected 500 starting points in the whole design domain U = [1, 75]× [1,250].In each optimization run, the total number of evaluations was limited to 50.000 for fmincon and to 5.000 for SCIBL-CSG.To obtain an overview of the general performance of the different approaches, we take snapshots of all iterates after different amounts of evaluations.The results are given in Figure 9 and Figure 10 and yield a good impression on how fast each method tends to find solutions to (1).Note that, for the sake of readability and better comparison, the final CSG iterates after 5.000 evaluations are shown in all graphs labeled with a higher number of total evaluations.By comparing Figure 9 and Figure 10 with Figure 4, we observe that the artificial flat regions discussed earlier indeed slow down the optimization progress for all choices of prediscretization.Furthermore, we note that only the highest resolution 10 × 10 × 10 overcomes this approximation error, at the cost of the largest amount of evaluations needed.In contrast, the resolutions 7 × 7 × 7 and 8 × 2 × 2 converge much faster, but some of the final designs are no stationary points of (1).Out of the 500 optimization runs we performed, 7 × 7 × 7 converged to a wrong design, i.e., artificial local minimum, 16 times (3.2%).For 8 × 2 × 2, a wrong design was found in 218 (43.6%) instances, see Figure 10.
Lastly, we are interested in the performance of each method with respect to J(u n ) over the course of the iterations.Since each local solution to (1) admits a different objective function value, we focus only on the global maximum.For all approaches, we selected all runs whose final designs are closer to the global Fig. 7 Median objective function value of all optimization runs in which the final design was closer to the global maximum of (1) than to any other stationary point.The values were obtained using a discretization into 50 × 50 × 50 points.
Fig. 8 The medians presented in Figure 7 (solid lines) and the corresponding quantiles P 0.25,0.75, indicated by the shaded areas.For better visibility, the number of evaluations is scaled logarithmically and the discretization 8 × 2 × 2 was discarded.maximum of (1) than to any other stationary point.The results are shown in Figure 7 and Figure 8.

Optimization in the DDA Model
As a final example from application, we drop the restriction to core shell particles and consider hematite nanoparticles of arbitrary shape with the DDA model.While the setting is very similar to the setting analyzed above, there are some minor differences.
First, we slightly change the weights appearing in the objective function: This change was made purely for aesthetics, as the weights in (1) favour radially symmetric solutions, while (2) admits local solutions with a more interesting design structure.Furthermore, we do not assume a particle design distribution anymore, since it is unclear, how such a general shape distribution should look like.However, as the particles are no longer radially symmetric, we now have to consider the orientation of the particle with respect to the incoming light ray instead.Therefore, the K and S values explained in the introduction of this setting need to be averaged over all possible orientations, i.e., Abs(u, λ, ν)dν  and Here, S 2 denotes the unit sphere and the particle orientation ν is assumed to be distributed uniformly random over all possible directions.
The design domain is a ball of 300nm diameter, discretized into n 0 = 65752 dipole cells.The design u ∈ [0, 1] n0 gives the relative amount of hematite to water in each cell.The optical properties of intermediate (grey) material u (i) ∈ (0, 1) are generated by linear interpolation between the respective properties of water and hematite.
Generally, one would combine filtering techniques and greyness penalization to obtain a smooth final design without intermediate material (see, e.g., [24]).However, we explicitly refrain from doing so to present a clear analysis of the CSG performance, without interference from secondary layers of smoothing techniques.
As mentioned above, the change to the DDA model significantly increases the computational cost of evaluating Sca, Abs and Geo for a given (u, λ, ν) ∈ U × Λ × S 2 .Thus, the deterministic approaches used in the previous setting are no longer computationally feasible.
Furthermore, we want to use this example to analyze the impact of the chosen norm on U × Λ × S 2 , appearing in the nearest neighbor calculation, which was already mentioned in [2, Section 3.5].To be precise, calculating the CSG integration weights requires the definition of an outer norm where • U , • Λ and • S 2 denote norms on the corresponding inner spaces and c u , c λ , c ν > 0. In this application, we choose the euclidean norm • 2 for each inner space.Additionally, we fix c u = 1, but consider different coefficients c λ and c ν .
For the optimization, we consider three different initial designs, which are shown in Figure 11, top row.The objective function value as well as the values of L, a and b for these designs were computed using the CSG method with fixed design, i.e., with constant step size τ = 0, and verified by Monte Carlo (see, e.g., [25]) integration.For one of the initial designs, the objective function value approximation of CSG and Monte Carlo integration with respect to the number of evaluations and different choices of • Out is shown in Figure 12.
Each design was optimized with SCIBL-CSG, using inexact hybrid weights for the integration over S 2 and exact hybrid weights for the integration over Λ.For • Out , we considered four different choices of the parameters:   The results in case (a) for all three initial designs are presented in Figure 13 and the respective design evolution for the initial design screwdriver (50%), shown in Figure 11 top row, is depicted in Figure 14.The corresponding final designs, obtained after 5.000 SCIBL-CSG iterations, are presented in Figure 11, bottom row.As a second measure for convergence in the design space, the evolution of the norm distance to the respective final designs are shown in Figure 15 for all three initial designs.
Comparing Figure 12 and Figure 13, we notice that CSG, using an appropriate outer norm, finds an optimized design almost as fast as it computes the objective function value for a given design.In other words: The full optimization process is only slightly more expensive that the simple evaluation of a single design.Moreover, CSG finds an optimal solution to (2) long before the Monte Carlo approximation to the initial objective function value is converged.
It should, of course, also be noted, that choosing • Out should be done with caution, as Figure 16 shows.While case (a) is, to the best of our knowledge, not optimal by any means, cases (b) and (c) clearly show worse results.Choosing Fig. 12 Objective function approximation for the screwdriver (50%) design.The blue and orange curve show the results for CSG with fixed step size τ = 0 and different coefficients of the outer norm • Out .For Monte Carlo, each inner integral over S 2 was approximated using 40 random directions.The true objective function value J * ≈ 37.84 is indicated by the dashed line.The Monte Carlo results are truncated for the sake of readability, as it requires over 8.000 evaluations to reach a approximation to J * .• Out extremely poorly, i.e., case (d), can even have devastating effects on the performance, see Figure 17.
This, however, could also imply that the performance might be significantly improved, if problem specific inner and outer norms would be chosen.Especially in even more complex settings, techniques to obtain such norms a priori, or even during the optimization process itself, represent one of the most important points for further research.3

Online Error Estimation
Before we go into theoretical details, we first collect a few key properties and results concerning CSG, which were shown in [2].In a first simple setting, we consider optimization problems of the form min J(u) Additionally, we assume that U is compact, and for some d r ∈ N, there exists an open an bounded set X ⊂ R dr and a measure µ with supp(µ) ⊂ X , such that J can be written as J(u) = X j(u, x)µ(dx).The detailed set of assumptions is given in [2, Section 2].For now, it is only important that ∇ 1 j : U × X → R is bounded and Lipschitz continuous with Lipschitz constant L j .
During the optimization process, CSG computes design dependent integration weights α k k=1...,n (cf.[2, Section 3]) to build an approximation Ĝn to the true objective function gradient, based on the available samples from previous iterations ∇ 1 j(u k , x k ) k=1,...,n .To be precise, we have Carefully investigating the methods to obtain the integration weights, we observe that where ν n denotes the measure associated to one of the measures listed in [2, Section 3.6], depending on the choice of integration weights, and By construction, M k contains all points x ∈ X , such that (u n , x) is closer to (u k , x k ) than to any other previous point we evaluated ∇ 1 j at.For exact The Continuous Stochastic Gradient Method integration weights, we have ν n = µ and thus Here, Z n is given by In other words, the approximation error can be bounded in terms of the Lipschitz constant of ∇ 1 j and the quantity Z n , which relates to the size of Voronoi cells [26] with positive integration weights.
Both L j and sup x∈X Z n (x) can be efficiently approximated during the optimization process, e.g. by finite differences of the samples ∇ 1 j(u i , x i ) i=1,...,n and by sup yielding an online error estimation.Such an approximation may, for example, be used in stopping criteria.

Convergence Rates
Throughout this section, we assume [2, Assumptions 2.2 -2.8] to be satisfied.

Theoretical Background
In the convergence analysis presented in [2], we have already seen that the fashion in which the gradient approximation Ĝn is calculated in CSG is crucial for Ĝn − ∇J(u n ) → 0 and that this property of CSG in turn is the key to all advantages CSG offers in comparison to classic stochastic optimization methods, like convergence for constant steps, backtracking, more involved optimization problems, etc.The price we pay for this feature lies within the dependency of Ĝn on the past iterates.For comparison, the search direction ĜSG n in a stochastic gradient descent method is given by ĜSG Thus, it is independent of all previous steps and fulfills i.e., it is an unbiased sample of the full gradient.The combination of these properties allows for a straight-forward convergence rate analysis, see, e.g., [27].
In contrast, Ĝn is in general not an unbiased approximation to ∇J(u n ) and moreover not independent of u i , x i ) i=1,...,n−1 .The main problem in finding the convergence rate of u n+1 − u n → 0 is, that this quantity depends on the approximation error Ĝn − ∇J(u n ) , which, as we have seen in Section 3, depends on Z n .Since Z n itself is deeply connected to min k u n − u k , we run into a circular argument.
Therefore, up to now, we are not able to proof convergence rates for the CSG iterates.We can, however, state a prediction to this rate and provide numerical evidence.
Claim 4.1 We claim that the CSG method, applied to problem (3), using a constant step size τ < 2 L and empirical integration weights, fulfills To motivate this claim, note that, in the proof of [2, Lemma 4.7], it was shown that there exists where d W denotes the Wasserstein distance of the two measures µ n and µ.By [28, Theorem 1], the empirical measure µ n satisfies This result is the main motivation for Claim 4.1.It can be shown that the rate n −1/dr for d r ≥ 3 is sharp if µ corresponds to a uniform distribution on X .Thus, in this case, it is reasonable to assume a uniform distribution also corresponds to the worst-case rate of X Z n (s)µ(dx) → 0. Assuming that the The Continuous Stochastic Gradient Method difference in designs appearing in Z n is negligible due to the overall convergence of CSG, we obtain the rate To see this, we fill X ⊂ R dr with balls (w.r.t. the norm • X ) of radius ε > 0 and denote by N (ε) ∈ N the number of cells.Due to the dimension of X , we have O N (ε) = ε −dr .Now, to achieve sup x∈X Z n (x) < ε, we need each of these cells to contain at least one of the sample points (x i ) i=1,...,n .It is well-known that the expected number of samples we need to draw for this to happen is given by where we used In other words, the convergence rates of X Z n (x)µ(dx) → 0 and d W (µ n , µ) → 0 are comparable.Now that we motivated the rates claimed in Claim 4.1 for the approximation error Ĝn − ∇J(u n ) , we use the following proposition to show that the rates of u n+1 − u n → 0 can not be worse.
Proof Assume for contradiction that this is not the case.Thus, there exists N ∈ N such that By the descent lemma [29,Lemma 5.7], the characteristic property of the projection operator [29,Theorem 6.41] and the Cauchy-Schwarz inequality, we obtain Combining this with (4) gives J(u n+1 ) ≤ J(un) for all n ≥ N , since L 2 < 1 τ .Thus, the sequence of objective function values J(un) n∈N is monotonically decreasing for all n ≥ N .By continuity of J and compactness of U, J is bounded and J(un) → J for some J ∈ R. Therefore, Hence, the series

Numerical Verification
We want to verify the proclaimed rates numerically.For this purpose, we consider two optimization problems that can easily be scaled to high dimensions.
The first problem is given by where The optimal solution to (5) and ( 6) is given by the zero vector u * = 0 ∈ U.
In our analysis, for different values of the dimensions d r , d o ∈ N, problems (5) and ( 6) were initialized with 500 random starting points.The constant step size of CSG was chosen as τ = 1 2 .We track u n − u * and max k=1,...,n Z n (x k ) during the optimization process and compare the median of the 500 runs to the rates predicted in Claim 4.1.The results can be seen in Figures 18 to 21.Note that, for the plots of the predicted rates, we omitted the factor ln(n).Therefore, The Continuous Stochastic Gradient Method the corresponding graphs are straight lines, where the slope − 1 max{2,dr} is equal to the asymptotic slope of the predicted rate, since for all ε > 0.
In the equidimensional, i.e., dim(X ) = dim(U), setting (5), the experimentally obtained values for Z n almost perfectly match the claimed rates.For u n − u * , the observed rates also match the predictions for very small and large dimensions.For d r = 3, 4, 5, the convergence obtained in the experiments was even slightly faster than predicted.Investigating the results for (6), it is clearly visible that increasing the design dimension d o , while keeping the parameter dimension d r fixed, has no influence on the obtained rates of convergence, indicating that CSG is able to efficiently handle large-scale optimization problems.

Circumventing Slow Convergence
As we have seen so far, the convergence rate of the CSG method worsens with increasing dimension of integration d r ∈ N.However, it is possible to circumvent this behavior, if the problem admits additional structure.Assume that there exist suitable X 1 , X 2 , µ 1 , µ 2 , f 1 and f 2 such that the objective function appearing in (3) can be rewritten as Assume further, that X 1 , X 2 , µ 1 , µ 2 , f 1 and f 2 satisfy the corresponding equivalents of [2, Assumptions 2.2 -2.8].Now, we can independently calculate integration weights (α k ) k=1,...,n and (β k ) k=1,...,n for the integrals over X 1 and X 2 , respectively.The corresponding CSG approximations (indicated by hats) are then given by The same steps as performed in the proof of [2,Lemma 4.7] yield the existence of a constant C 1 > 0, depending only on the Lipschitz constants of ∇f 1 and Here, ν β n corresponds to the measure related to the integration weights (β k ) k=1,...,n , see [2,Assumption 2.8].Now, denoting by C 2 > 0 a constant depending on the Lipschitz constant L f2 of f 2 , we decompose the last term: Assuming that the convergence of the sequence (u n ) n∈N generated by the CSG method implies we insert (8) into (7), to obtain In conclusion, we claim that, assuming the objective function can be rewritten in terms of nested expectation values the convergence rate of the CSG method depends only on the largest dimension of the occurring X i , which may be much lower when compared to dim(X ).Since this is again a claim and not a rigorous proof, we validate this assumption numerically.For this, we once more consider (5) and initialize it with 500 random starting points.This time, however, we utilize the fact that the objective function can be written as Thus, we can group the independent coordinates into subintegrals of arbitrary dimension, allowing us to study our claim for a large number of different regroupings without having to change the whole problem formulation.The results for several different decompositions and 500 random starting points in the case d r = 100 are shown in Figure 22.The improved rates of convergence are clearly visible, independent on whether the subgroup dimensions are equal or not.As claimed above, the highest remaining dimension of integration determines the overall convergence rate of CSG.

Conclusion and Outlook
In this contribution, we presented a numerical analysis of the CSG method.The practical performance of CSG was tested for two applications from nanoparticle design optimization with varying computational complexity.
For the low-dimensional problem formulation, CSG was shown to perform superior when compared to the commercial fmincon blackbox solver.The high-dimensional setting provided an example, for which classic optimization schemes (stochastic as well as deterministic) from literature do not provide optimal solutions within reasonable time.The Continuous Stochastic Gradient Method Convergence rates for CSG with constant step size were proposed and analytically motivated.They were shown to agree with numerically obtained convergence rates in several different instances.Moreover, in the case that the objective function admits additional structure, techniques to circumvent slow convergence for high dimensional integration domains were presented.
While the proposed convergence rates for CSG agree with our experimental results, it remains an open question if they can be proven rigorously.Furthermore, even though the choice of a metric for the nearest neighbor approximation in the integration weights is irrelevant for the convergence results, a problem specific metric could significantly improve the performance of CSG by exploiting additional structure, which might be lost by utilizing an arbitrary metric.How to automatically obtain such a metric during the optimization process requires further research.
Data Availability Statement.The simulation datasets generated during the current study are available from the corresponding author on reasonable request.

Fig. 1
Fig. 1 Resulting color for various different values of a and b.Positive values of a result in red colors, while colors corresponding to negative values of a appear green.Similarly, positive b values yield yellow colors, while negative b values shift the color into the blue spectrum.In this figure, we fixed L = 50.

Fig. 2
Fig. 2 Radially symmetric core-shell nanoparticle.The inner core (blue) has radius R in the range of 1nm -75nm and consists of water.The thickness of the hematite shell (red) is denoted by d and ranges from 1nm to 250nm.

Fig. 3
Fig.3Objective function values for fixed core radius of 3nm.Different graphs correspond to different discretizations.The label of a curve shows into how many points the integrals over Λ, R and D have been split, respectively.Each of the discretizations introduces artificial stationary points into the objective function.

Fig. 4
Fig.4Flat regions in the discretizted objective functions.The underlying contour plot corresponds to the discretization of Λ × R × D into 50 × 50 × 50 points.For each figure, the green region consists of all points at which the euclidean norm of the gradient of the discretized objective function is smaller than 0.05.The discretizations of Λ × R × D are given in the titles, respectively.

Fig.
Fig.5Cumulative density function for R in the case R = 80.The six integration points (red dots) are obtained by dividing (0, 1) in six intervals of equal size and calculating the midpoints of the resulting preimages (black crosses).Note that the preimages are first projected on the 3σ-interval.

Fig. 9
Fig. 9 Iterates of the different optimization approaches for (1) in the whole design domain U = [1, 75] × [1, 250].For fmincon, the discretization of Λ × R × D is given in the titles, respectively.To measure the progress, the starting points are also shown.As mentioned above, an evaluation corresponds to the calculation of Abs, Sca, Geo, ∇ Abs, ∇ Sca and ∇ Geo for one combination (λ, R, d) ∈ Λ×R×D.Again, the underlying contours are obtained by discretizing Λ × R × R into 50 × 50 × 50 points.

Fig. 10
Fig.10Continuation of the results for (1) presented in Figure9.Since CSG was stopped after 5.000 evaluations, the iterates do not change afterwards, but are still shown as a point of reference.In the last row, final designs obtained by 7 × 7 × 7 and 8 × 2 × 2, which do not correspond to stationary points of (1), are highlighted in blue.

Fig. 11
Fig.11Representation of the initial designs (top row).Red boxes correspond to cells consisting purely of hematite, while grey boxes indicate an artificial intermediate material, consisting of 50% hematite and 50% water.For later references, we denote the initial designs by plate (100%), plate (50%) and screwdriver (50%), respectively.The different final designs, obtained by 5.000 iterations of SCIBL-CSG with outer norm (a) are shown in the bottom row.For better visibility, cells with less than 50% hematite are considered as pure water and left out of the visualization.For each final design, the amount of cells discarded in this fashion is less than 100 (less than 0.15% of all cells).

Fig. 13
Fig. 13 CSG objective function approximations during the optimization process for all initial designs and choice (a) for • Out , i.e., cu = 1, c λ = 100 and cν = 100.The dashed lines indicate the objective function values of each initial design, respectively.

Fig. 14
Fig.14Top left to bottom right: Design evolution during the optimization process for the screwdriver (50%) initial design and outer norm (a).The design snapshots were taken every 200 iterations.Red boxes represent design cells consisting of pure hematite.Intermediate material is indicated via a color gradient, where a cell filled with 50% water and 50% hematite is colored grey.Based on this gradient, depending on the ratio of hematite and water in a cell, the cell color is shifted to red (more hematite) or blue (more water).

Fig. 15
Fig. 15 Euclidean distance (after dividing by dim(U ) for scaling) between intermediate designs and the respective final design during the SCIBL-CSG optimization process, carried out with outer norm (a).

Fig. 16
Fig. 16 CSG objective function value approximation during the optimization process for the plate (100%) initial design.The dashed line shows the inital objective function value, whereas the different graphs correspond to the choices (a), (b) and (c) for • Out .

Fig. 17
Fig.17Results for the plate (100%) initial design presented in Figure16, augmented by the CSG objective function value approximation in the case that • Out was chosen according to (d).
Fig.17Results for the plate (100%) initial design presented in Figure16, augmented by the CSG objective function value approximation in the case that • Out was chosen according to (d).

Fig. 18 1 max{2
Fig. 18 The bold lines represent the median values of max k=1,...,n Zn(x k ) for the equidistant problem (5) with respect to the iteration counter.The different colors indicate the different dimensions dr ∈ {1, 2, . . ., 500}.The dotted lines correspond to the respective predicted rates n − 1 max{2,dr} .Since the predictions for dr = 1 and dr = 2 are equal, only the case dr = 2 is shown.

Fig. 19
Fig. 19 Median values of un − u * in the equidimensional setting (5) for different choices of dr ∈ {1, 2, . . ., 500}.For each dimension, the predicted worst-case asymptotic line n − 1 max{2,dr} is indicated by the dotted line.Again, we omit the prediction for dr = 1, since it has the same slope as in the case for dr = 1.

Fig. 20
Fig.20Results for the median of max k=1,...,n Zn(x k ) in setting(6) for different dimensions do ∈ {1, 2, . . ., 1000}, indicated by different colors.As we conjectured, the asymptotic slope of all curves is equal, since dr = 1 is fixed.As a point of reference, we added the graph of n −0.65 , represented by the dotted line.

Fig. 21
Fig.21Median distance to the optimal solution u * during the course of the iterations for do ∈ {1, 2, . . ., 1000}.Again, the asymptotic slope of all curves is equal and we added the line corresponding to n −0.65 for comparison.

Fig. 22 1 2
Fig. 22 Median total error un − u * of the CSG iterates for (5), for dr = 100.The integral over X = − 1 1 , 1 2 dr has been decomposed into several integrals of smaller dimension.The labels in the bottom left give details about the decomposition, e.g., the orange line corresponds to splitting the whole integral into one integral of dimension 75 and 5 integrals of dimension 5.The dotted line indicates the expected rate of convergence obtained by the CSG method without splitting up the integral.