1 Introduction

In this paper, we present a numerical analysis of the Continuous Stochastic Gradient (CSG) method, which was first proposed in [1]. Later, in [2], it was shown that the error in the CSG gradient and objective function approximation vanishes during the course of the iterations. This key property of CSG yields strong convergence results known from classic gradient methods, e.g., convergence of the sequence of iterates for constant step sizes, which are beyond the scope of standard stochastic approaches known from literature, like the Stochastic Gradient (SG) method [3], or the Stochastic Average Gradient (SAG) method [4].

Furthermore, the approximation property of CSG significantly increases the set of possible applications, allowing for more complex structures in the optimization problem than the schemes listed before. While CSG was shown to perform better than various stochastic optimization approaches on academic examples [2], it remains to see if this is also the case for more involved applications. For this purpose, we consider several optimization problems arising in the context of optimal nanoparticle design. These applications focus on optimization with respect to the resulting color of a particulate product, as it represents one of the most prominent fields of research within this setting [5,6,7,8,9,10].

Moreover, all convergence results stated in [2] provide no insight on the rate of convergence. Since this plays a crucial role for the practicability of CSG, it is of great importance to further analyze this quantity. In this contribution, we conjecture estimated convergence rates for the general CSG method and verify them numerically.

1.1 Structure of the paper

Section 2 introduces the application from nanoparticle optics, mentioned above. Two different methods to model the particle, varying greatly in computational effort and design dimension, are presented. After detailing the setting and challenges in the low-dimensional optimization problem, we compare the results of the CSG method to different approaches based on the fmincon algorithm provided by MATLAB (Sect. 2.7). Later on, we analyze the high-dimensional problem formulation purely within the CSG framework, since a comparison with generic deterministic optimization schemes is out of scope, due to the associated computational complexity.

Afterwards, Sect. 3 shortly covers techniques to estimate the gradient approximation error during the optimization, before we focus on the convergence rate of CSG in Sect. 4. While the expected rates stated therein are not proven, we present detailed numerical examples to solidify our claims. Furthermore, we analyze how the convergence rate depends on the dimension of integration and how to avoid slow convergence, if the objective function admits additional structure.

2 Nanoparticle design optimization

Since the design of a nanoparticle, i.e., its shape, size, material distribution, etc., heavily impacts its optical properties, the task of optimizing a nanoparticle design with respect to a specific optical property arises naturally [11]. In this section, we are interested in using hematite nanoparticles to optimize the color of a paint film [12]. Thus, we start by introducing our main framework for this application.

2.1 Color spaces

First off, we should explain what optimal color means in our setting. There are several different methods to describe color mathematically, e.g., assigning each color an RGB representation vector \(\textbf{v}\in \mathbb {R}^3\), where the three components of \(\textbf{v}\) correspond to the red, green and blue value of the color. In our application, we are interested in the color of the paint film as it appears to the human eye. Therefore, the underlying color space should be chosen based on the following property:

If the Euclidean distance between the representation vectors of two colors is small, the colors should be almost indistinguishable to the human eye.

As it turns out, the RGB color space is a very poor choice with respect to this feature. Hence, we instead choose the CIELAB color space [13], which was introduced by the International Commission of Illumination (Commission Internationale de l’Eclairage, CIE), as it was designed with this exact purpose in mind. The CIELAB representation of a color consists of three values \(\textbf{L}\), \(\textbf{a}\) and \(\textbf{b}\). Here, \(\textbf{L}\) corresponds to the lightness of a color and ranges from 0 (black) to 100 (white). The values of \(\textbf{a}\) and \(\textbf{b}\), typically within the range of \(\pm 150\), describe the colors position with respect to the opponent color pairs green-red and blue-yellow. A short overview is given in Fig. 1.

Another color space, which naturally arises from our setting, is the CIE 1931 XYZ color space [14]. The values of X, Y and Z can be calculated by integrating the optical properties of a particle over the spectrum of visible light (400–700 nm), which we denote by \(\Lambda \). Each of these integrations is weighted by the corresponding color matching functions \(x,y,z:\Lambda \rightarrow \mathbb {R}\).

Thus, in our application, we will first calculate the CIE 1931 XYZ representation of the resulting color and then use the (nonlinear) color space transformation \(\Psi :\mathbb {R}^3\rightarrow \mathbb {R}^3\) with \(\Psi (\text {X,Y,Z}) = (\textbf{L},\textbf{a},\textbf{b})^\top \), to work in the CIELAB color space. For this transformation, we define a reference white point

$$\begin{aligned} \begin{pmatrix} \text {X}_r \\ \text {Y}_r \\ \text {Z}_r \end{pmatrix} = \begin{pmatrix} 94.72528492 \\ 100 \\ 107.13012997\end{pmatrix} \end{aligned}$$

and denote the relative XYZ values by

$$\begin{aligned} \tilde{\text {X}} = \tfrac{\text {X}}{\text {X}_r}, \quad \tilde{\text {Y}} = \tfrac{\text {Y}}{\text {Y}_r}, \quad \text {and}\quad \tilde{\text {Z}} = \tfrac{\text {Z}}{\text {Z}_r}. \end{aligned}$$

Utilizing the intended CIE parameters \(\epsilon = \tfrac{216}{24389}\) and \(\kappa = \tfrac{24389}{27}\), the LAB color values are then given by

$$\begin{aligned} \textbf{L}= 116f(\tilde{\text {Y}}) -16,\quad \textbf{a}= 500\big (f(\tilde{\text {X}})-f(\tilde{\text {Y}})\big )\quad \text {and}\quad \textbf{b}= 200\big (f(\tilde{\text {Y}})-f(\tilde{\text {Z}})\big ), \end{aligned}$$

where \(f:\mathbb {R}\rightarrow \mathbb {R}\) is defined as

$$\begin{aligned} f(t) = {\left\{ \begin{array}{ll} \root 3 \of {t} &{} \text {if }t > \epsilon \\ \tfrac{\kappa t + 16}{116} &{} \text {otherwise} \end{array}\right. }. \end{aligned}$$
Fig. 1
figure 1

Resulting color for various different values of \(\textbf{a}\) and \(\textbf{b}\). Positive values of \(\textbf{a}\) result in red colors, while colors corresponding to negative values of \(\textbf{a}\) appear green. Similarly, positive \(\textbf{b}\) values yield yellow colors, while negative \(\textbf{b}\) values shift the color into the blue spectrum. In this figure, we fixed \(\textbf{L}= 50\)

2.2 Mie theory and discrete dipole approximation

Given a nanoparticle shape and material, we can use the time-harmonic Maxwell’s equations to calculate its optical properties. Specifically, in our setting, we are interested in the absorption (\({{\,\textrm{Abs}\,}}\)), scattering (\({{\,\textrm{Sca}\,}}\)) and geometry factor (\({{\,\textrm{Geo}\,}}\)) [15, Section 2.8]. These properties describe the interactions of a particle with light and are therefore dependent not only on the particle’s design, but also its orientation w.r.t. the incoming lightwave as well as the wavelength of said light. The time required and precision achieved in their numerical calculation are, of course, dependent on our model of the nanoparticle and the method used to solve Maxwell’s equations. For our setting, we choose two different approaches.

On the one hand, we will use the discrete dipole approximation (DDA) [16,17,18], in which the particle is discretized into an equidistant grid of dipole cells. Thus, DDA allows the analysis of arbitrary particle shapes and material distributions. The downside lies within the computational complexity of the method, which scales with the total number of dipoles and therefore grows rapidly when increasing the resolution. While the CSG method is still capable of solving the resulting optimization problem in our experiments, the tremendous computational cost associated to the DDA approach severely impede a detailed analysis of the problem. Especially, there is no computationally feasible, generic optimization scheme to compare our results with. However, we want to note that optimization in the DDA model has already been done in a slightly simpler setting, where the full integral over \(\Lambda \) was replaced by summation over a small number of different wavelengths [19].

On the other hand, Mie theory [20, 21] provides a numerically cheap alternative, at the price of a more restrictive setting. In Mie theory, one only considers radially symmetric particles. In this special setting, it is possible to find analytic solutions based on series expansions to the time-harmonic Maxwell’s equations. Therefore, in our first approach, we will only consider core-shell particles, as the utilization of Mie theory allows for a much deeper analysis of the resulting optimization problem and comparison to deterministic optimization approaches, which rely on discretization of the integrals.

2.3 Nanoparticles in paint film—Kubelka–Munk theory

As mentioned above, the XYZ color values of the paint film can be calculated by integration of the corresponding color matching functions xyz and the important optical properties of the nanoparticle. The precise method to obtain X, Y and Z is given by the Kubelka–Munk theory [22], augmented by a Saunderson correction [23]. For a paint film, in which nanoparticles with design u are oriented in direction \(\nu \in \mathbb {S}^2\), that is illuminated by light with wavelength \(\lambda \in \Lambda \), the resulting color can be expressed by the K and S value

$$\begin{aligned} K(u,\lambda ,\nu ) = {{\,\textrm{Abs}\,}}(u,\lambda ,\nu )\quad \text {and}\quad S(u,\lambda ,\nu ) = {{\,\textrm{Sca}\,}}(u,\lambda ,\nu )\big (1-{{\,\textrm{Geo}\,}}(u,\lambda ,\nu )\big ) \end{aligned}$$

via the reflectance

$$\begin{aligned} R_\infty (u,\lambda ,\nu ) = 1 + \frac{8}{3}\frac{K(u,\lambda ,\nu )}{S(u,\lambda ,\nu )} - \sqrt{\left( \frac{8}{3}\frac{K(u,\lambda ,\nu )}{S(u,\lambda ,\nu )}\right) ^2 + \frac{16}{3}\frac{K(u,\lambda ,\nu )}{S(u,\lambda ,\nu )}}\,. \end{aligned}$$

Now, X, Y and Z can be obtained by

$$\begin{aligned} \text {X}(u,\nu )&= \int _\Lambda x(\lambda )\frac{(1-\rho _0-\rho _1)R_\infty (u,\lambda ,\nu )+\rho _0}{1-\rho _1 R_\infty (u,\lambda ,\nu )}\,\textrm{d}\lambda , \\ \text {Y}(u,\nu )&= \int _\Lambda y(\lambda )\frac{(1-\rho _0-\rho _1)R_\infty (u,\lambda ,\nu )+\rho _0}{1-\rho _1 R_\infty (u,\lambda ,\nu )}\,\textrm{d}\lambda , \\ \text {Z}(u,\nu )&= \int _\Lambda z(\lambda )\frac{(1-\rho _0-\rho _1)R_\infty (u,\lambda ,\nu )+\rho _0}{1-\rho _1 R_\infty (u,\lambda ,\nu )}\,\textrm{d}\lambda , \end{aligned}$$

where \(\rho _0\) and \(\rho _1\) are material parameters. In our setting, which we introduce in the next section, we have \(\rho _0 = 0.04\) and \(\rho _1 = 0.6\). Moreover, x, y and z are the color matching functions, as given in [24].

2.4 Problem formulation

In our first setting, we consider a radially symmetric core-shell nanoparticle (see Fig. 2), where the inner core consists of water, while the outer shell is made of hematite. Thus, the design u consists of the radius R (1–75 nm) of the core and the thickness d (1–250 nm) of the outer hematite shell, i.e., we have \(u=(R,d)\in \mathcal {U}= [1,75]\times [1,250]\). Due to the symmetry of the particle, its optical properties do not depend on the orientation \(\nu \in \mathbb {S}^2\), which is why we omit it in our further analysis of this setting.

Fig. 2
figure 2

Radially symmetric core-shell nanoparticle. The inner core (blue) has radius R in the range of 1–75 nm and consists of water. The thickness of the hematite shell (red) is denoted by d and ranges from 1 to 250 nm

As an additional layer of difficulty, we can, in practice, not expect all nanoparticles present in the paint film to be identical copies of design u. Instead, when trying to produce nanoparticles of a specific design in large quantities, one usually ends up with a mixture of particles of different designs, following a certain probability distribution \(\mu _u\), which is dependent on the intended design u.

We model this aspect by assuming that, given a design \(u=(R,d)\), the particles present in the paint film follow a truncated normal distribution on the space of reasonable designs \({\mathcal {R}\times \mathcal {D}=[10^{-4},150]\times [10^{-4},500]}\) centered around u, i.e.,

$$\begin{aligned} {\tilde{R}}\sim \mathcal {N}_{_\mathcal {R}}(R,\tfrac{1}{10}R)\quad \text {and}\quad {\tilde{d}} \sim \mathcal {N}_{_\mathcal {D}}(d,\tfrac{1}{10}d). \end{aligned}$$

Truncating the normal distribution to the space \(\mathcal {R}\times \mathcal {D}\) circumvents nonphysical particles appearing in the design distributions, like designs with negative components. From a numerical point of view, the impact is negligible, as the combined weight of all excluded designs is below typical machine precision, since a design component must deviate from the average by more than 9 standard deviations in order to be rejected. As the paint film no longer consists of identical particles, the K and S values in the Kubelka–Munk model need to be replaced by their averaged counterparts

$$\begin{aligned} K(u,\lambda )&= \iint _{\mathcal {R}\times \mathcal {D}}{{\,\textrm{Abs}\,}}({\tilde{R}},{\tilde{d}},\lambda )\textrm{d}\mu _u({\tilde{R}},{\tilde{d}}) \end{aligned}$$

and

$$\begin{aligned} S(u,\lambda )&= \iint _{\mathcal {R}\times \mathcal {D}} {{\,\textrm{Sca}\,}}({\tilde{R}},{\tilde{d}},\lambda )\big (1-{{\,\textrm{Geo}\,}}({\tilde{R}},{\tilde{d}},\lambda )\big )\textrm{d}\mu _u({\tilde{R}},{\tilde{d}}), \end{aligned}$$

before calculating the reflectance \(R_\infty (u,\lambda )\) and integrating it over \(\Lambda \).

The objective in our application is to produce a paint of bright red color. Thus, the complete optimization problem reads

$$\begin{aligned} \max _{u\in \mathcal {U}}\quad \tfrac{1}{20}\,\textbf{L}(u) + \tfrac{19}{20}\,\textbf{a}(u). \end{aligned}$$
(1)

Due to the compactness of \(\mathcal {U}\), \(\mathcal {R}\) and \(\mathcal {D}\), [2, Assumption 2.2] is obviously satisfied. Furthermore, the mapping from a design u, wavelength \(\lambda \) and orientation \(\nu \) to the optical properties \({{\,\textrm{Abs}\,}}\), \({{\,\textrm{Sca}\,}}\) and \({{\,\textrm{Geo}\,}}\) is smooth [25, Eqs. 1a, 1b, 1c]. Since every admissible design has a hematite shell of positive thickness, we obtain a lower bound on \({{\,\textrm{Abs}\,}}\) and \({{\,\textrm{Sca}\,}}\). By definition, the geometry factor is always smaller than 1 in absolute value. Consequently, \(R_\infty \) depends smoothly on \({{\,\textrm{Abs}\,}}\), \({{\,\textrm{Sca}\,}}\) and \({{\,\textrm{Geo}\,}}\). Now, by construction, \(R_\infty \) admits values in [0, 1] only. The color matching functions x, y, z are given pointwise and can thus be interpolated with Lipschitz continuous derivative. As a result, \(\textrm{X}\), \(\textrm{Y}\), \(\textrm{Z}\) are L-smooth function w.r.t. all arguments. Finally, the function f, appearing in the definition of the color transformation mapping \(\Psi \), is constructed in an L-smooth fashion as well, showing that [2, Assumption 2.3] is satisfied for our setting. By choosing integration weights presented in [2, Section 3], we can also satisfy [2, Assumption 2.4].

2.5 Challenges

The highly condensed fashion, in which (1) is formulated, may obscure a lot of the difficulties that arise when trying to solve it. To get a better understanding of the problem, let us first analyze the abstract structure of the objective function \(J(u) = \tfrac{1}{20}\,\textbf{L}(u) + \tfrac{19}{20}\,\textbf{a}(u)\):

$$\begin{aligned} \begin{pmatrix} {{\,\textrm{Abs}\,}}\\ {{\,\textrm{Sca}\,}}\\ {{\,\textrm{Geo}\,}}\end{pmatrix} \xrightarrow {\begin{array}{c} \text {integrate} \\ \mathcal {R}\times \mathcal {D} \end{array} } \begin{pmatrix} K\\ S\end{pmatrix} \xrightarrow {\begin{array}{c} \text {Kubelka-} \\ \text {Munk} \end{array} } R_\infty \xrightarrow {\begin{array}{c} \text {integrate}\\ \Lambda \end{array}} \begin{pmatrix} \text {X} \\ \text {Y} \\ \text {Z}\end{pmatrix} \xrightarrow {\begin{array}{c} \text {color} \\ \text {transf.}\Psi \end{array} } \begin{pmatrix} \textbf{L}\\ \textbf{a}\\ \textbf{b}\end{pmatrix}\xrightarrow []{}J(u). \end{aligned}$$

Since calculating J(u) and \(\nabla J(u)\) requires integrating the optical properties in multiple dimensions and since evaluating said properties for any combination of \({\tilde{R}}\), \({\tilde{d}}\) and \(\lambda \) requires solving the time-harmonic Maxwell’s equations, standard deterministic approaches, e.g., full gradient methods, run into a prediscretization problem.

On the one hand, the number of integration points needs to be sufficiently large for our setting. In Fig. 3, a slice through the objective function for a fixed value of R and several different amounts of integration points is shown. While we actually do not care too much about the approximation error resulting from a small number of integration points, the artificial local maxima introduced into the objective function by the discretization severely impact the quality of the optimization. In other words, many solutions to the discretized problem are completely unrelated to solutions to (1). We want to note that, even though not all of the stationary points in Fig. 3 correspond to stationary points of (1), the prediscretization still leads to very flat regions in the objective functions, which hinder the performance of many solvers. In Fig. 4, this effect is displayed.

On the other hand, the number of integration points is heavily restricted by the computational cost associated to the evaluation of \({{\,\textrm{Abs}\,}}\), \({{\,\textrm{Sca}\,}}\) and \({{\,\textrm{Geo}\,}}\). While medium resolutions (\(25^3\sim 15000\) points in total) are still numerically tractable for simple Mie particles, they are outright impossible to achieve in the more general DDA setting, which we want to consider later. For comparison: The optimization in [19] was carried out using a discretization consisting of 20 points in total.

We want to emphasize that standard SG-type schemes, or even the Stochastic Composition Gradient Descent (SCGD) method [26], which was used for the comparison for composite objective functions in [2, Section 7.2], are not capable of solving (1). The reason for this lies in the special structure of J, which consists of several integrals nested in nonlinear functions.

Fig. 3
figure 3

Objective function values for fixed core radius of 3 nm. Different graphs correspond to different discretizations. The label of a curve shows into how many points the integrals over \(\Lambda \), \(\mathcal {R}\) and \(\mathcal {D}\) have been split, respectively. Each of the discretizations introduces artificial stationary points into the objective function

Fig. 4
figure 4

Flat regions in the discretizted objective functions. The underlying contour plot corresponds to the discretization of \(\Lambda \times \mathcal {R}\times \mathcal {D}\) into \(50\times 50\times 50\) points. For each figure, the green region consists of all points at which the Euclidean norm of the gradient of the discretized objective function is smaller than 0.05. The discretizations of \(\Lambda \times \mathcal {R}\times \mathcal {D}\) are given in the titles, respectively

2.6 Discretization

For the reasons mentioned above, we will only compare the results obtained by CSG to generic deterministic optimization schemes for various choices of discretization. Since the integration over \(\Lambda \) admits no special structure, we always choose an equidistant partition for this dimension of integration. However, for the integration over \(\mathcal {R}\times \mathcal {D}\), we can use our knowledge of \(\mu _u\) to achieve a better approximation to the true integral. Instead of dividing \(\mathcal {R}\times \mathcal {D}\) into an equidistant grid, we utilize the fact that \({\tilde{R}}\) and \({\tilde{d}}\) follow truncated one-dimensional normal distributions with parameters independent from each other. Since, for a normal distribution, \(99.7\%\) of all weight is concentrated in the \(3\sigma \)-interval around the mean value, we may only discretize this portion of the full domain in each step.

Moreover, we know the precise density function for both \({\tilde{R}}\) and \({\tilde{d}}\). Thus, given a design \(u_n=(R_n,d_n)\), we will partition \(\left( R_n - \tfrac{3}{10}R_n, R_n + \tfrac{3}{10}R_n\right) \) and \(\left( d_n - \tfrac{3}{10}d_n, d_n + \tfrac{3}{10}d_n\right) \) not into equidistant intervals, but instead in intervals of equal weight. This procedure is illustrated in Figs. 5 and 6 and produces very good results even for a small number of sample points.

However, as we have already seen in Fig. 3, even this dedicated discretization scheme introduces additional propbelms into (1). Furthermore, we want to emphasize that choosing a reasonable discretization is a challenge of its own. Not only is there no a priori indication for the general magnitude of the number of points needed, it is also unclear whether or not one should use the same number of points in each direction.

Fig. 5
figure 5

Cumulative density function for \({\tilde{R}}\) in the case \(R=80\). The six integration points (red dots) are obtained by dividing (0, 1) in six intervals of equal size and calculating the midpoints of the resulting preimages (black crosses). Note that the preimages are first projected on the \(3\sigma \)-interval

Fig. 6
figure 6

Density function for \({\tilde{R}}\) in the case \(R=80\). The red dots represent the six integration points as detailed in Fig. 5. By their special construction, each shaded region under the curve is of equal area

2.7 Numerical results

As mentioned above, the restriction to radially symmetric nanoparticles allows us to apply standard blackbox solvers to (1), in order to have a comparison for the CSG results. In our case, we chose the fmincon implementation of an interior point algorithm, integrated in MATLAB, as is it an easy-to-use blackbox algorithm that yields reproducible results.

Specifically, we compared the results of SCIBL-CSG with empirical weights on \(\mathcal {R}\times \mathcal {D}\) and exact hybrid weights on \(\Lambda \) (cf. [2, Section 3]) to the fmincon results for three different discretization schemes of \(\Lambda \times \mathcal {R}\times \mathcal {D}\). Two of these are equal in each dimension (\(10\times 10\times 10\) and \(7\times 7\times 7\)), while the last one is asymmetric (\(8\times 2\times 2\)). Once again, we want to stress that finding an appropriate discretization scheme already requires a thorough analysis of (1). The specific choices listed above represent three of the most promising candidates found during our investigation (Figs. 78).

Fig. 7
figure 7

Median objective function value of all optimization runs in which the final design was closer to the global maximum of (1) than to any other stationary point. The values were obtained using a discretization into \(50\times 50\times 50\) points

Fig. 8
figure 8

The medians presented in Fig. 7 (solid lines) and the corresponding quantiles \(P_{0.25,0.75}\), indicated by the shaded areas. For better visibility, the number of evaluations is scaled logarithmically and the discretization \(8\times 2\times 2\) was discarded

Fig. 9
figure 9

Iterates of the different optimization approaches for (1) in the whole design domain \(\mathcal {U}=[1,75]\times [1,250]\). For fmincon, the discretization of \(\Lambda \times \mathcal {R}\times \mathcal {D}\) is given in the titles, respectively. To measure the progress, the starting points are also shown. As mentioned above, an evaluation corresponds to the calculation of \({{\,\textrm{Abs}\,}}\), \({{\,\textrm{Sca}\,}}\), \({{\,\textrm{Geo}\,}}\), \(\nabla {{\,\textrm{Abs}\,}}\), \(\nabla {{\,\textrm{Sca}\,}}\) and \(\nabla {{\,\textrm{Geo}\,}}\) for one combination \((\lambda ,{\tilde{R}},{\tilde{d}})\in \Lambda \times \mathcal {R}\times \mathcal {D}\). Again, the underlying contours are obtained by discretizing \(\Lambda \times \mathcal {R}\times \mathcal {R}\) into \(50\times 50\times 50\) points

Fig. 10
figure 10

Continuation of the results for (1) presented in Fig. 9. Since CSG was stopped after 5.000 evaluations, the iterates do not change afterwards, but are still shown as a point of reference. In the last row, final designs obtained by \(7\times 7\times 7\) and \(8\times 2\times 2\), which do not correspond to stationary points of (1), are highlighted in blue

As we consider this example to be a prototype for more advanced settings from topology optimization, e.g., switching the setting to the DDA model later, we compare the different approaches with respect to the number of inner gradient evaluations, since this is by far the most time-consuming step in these cases. To be precise, an evaluation represents the calculation of \({{\,\textrm{Abs}\,}}\), \({{\,\textrm{Sca}\,}}\), \({{\,\textrm{Geo}\,}}\), \(\nabla {{\,\textrm{Abs}\,}}\), \(\nabla {{\,\textrm{Sca}\,}}\) and \(\nabla {{\,\textrm{Geo}\,}}\) for a single \((\lambda ,{\tilde{R}},{\tilde{d}})\in \Lambda \times \mathcal {R}\times \mathcal {D}\). These calculations are based on the MATLAB Mie library MatScat [27].

Since the produced iterates depend on the initial design, we randomly selected 500 starting points in the whole design domain \(\mathcal {U}=[1,75]\times [1,250]\). In each optimization run, the total number of evaluations was limited to 50.000 for fmincon and to 5.000 for SCIBL-CSG. To obtain an overview of the general performance of the different approaches, we take snapshots of all iterates after different amounts of evaluations. The results are given in Figs. 9 and 10 and yield a good impression on how fast each method tends to find solutions to (1). Note that, for the sake of readability and better comparison, the final CSG iterates after 5.000 evaluations are shown in all graphs labeled with a higher number of total evaluations.

By comparing Figs. 9 and 10 with Fig. 4, we observe that the artificial flat regions discussed earlier indeed slow down the optimization progress for all choices of prediscretization. Furthermore, we note that only the highest resolution \(10\times 10\times 10\) overcomes this approximation error, at the cost of the largest amount of evaluations needed. In contrast, the resolutions \(7\times 7\times 7\) and \(8\times 2\times 2\) converge much faster, but some of the final designs are no stationary points of (1). Out of the 500 optimization runs we performed, \(7\times 7\times 7\) converged to a wrong design, i.e., artificial local minimum, 16 times (3.2%). For \(8\times 2\times 2\), a wrong design was found in 218 (43.6%) instances, see Fig. 10.

Lastly, we are interested in the performance of each method with respect to \(J(u_n)\) over the course of the iterations. Since each local solution to (1) admits a different objective function value, we focus only on the global maximum. For all approaches, we selected all runs whose final designs are closer to the global maximum of (1) than to any other stationary point. The results are shown in Figs. 7 and 8.

2.8 Optimization in the DDA model

As a final example from application, we drop the restriction to core shell particles and consider hematite nanoparticles of arbitrary shape with the DDA model. While the setting is very similar to the setting analyzed above, there are some minor differences.

First, we slightly change the weights appearing in the objective function:

$$\begin{aligned} \max _{u\in \mathcal {U}}\quad \tfrac{1}{2}\,\textbf{L}(u) + \tfrac{1}{2}\,\textbf{a}(u). \end{aligned}$$
(2)

This change was made purely for aesthetics, as the weights in (1) favour radially symmetric solutions, while (2) admits local solutions with a more interesting design structure. The set \(\mathcal {U}\) will be defined later.

Fig. 11
figure 11

Representation of the initial designs (top row). Red boxes correspond to cells consisting purely of hematite, while grey boxes indicate an artificial intermediate material, consisting of 50% hematite and 50% water. For later references, we denote the initial designs by plate (100%), plate (50%) and screwdriver (50%), respectively. The different final designs, obtained by 5.000 iterations of SCIBL-CSG with outer norm (a) are shown in the bottom row. For better visibility, cells with less than \(50\%\) hematite are considered as pure water and left out of the visualization. For each final design, the amount of cells discarded in this fashion is less than 100 (less than \(0.15\%\) of all cells)

Furthermore, we do not assume a particle design distribution anymore, since it is unclear, how such a general shape distribution should look like. However, as the particles are no longer radially symmetric, we now have to consider the orientation of the particle with respect to the incoming light ray instead. Therefore, the K and S values explained in the introduction of this setting need to be averaged over all possible orientations, i.e.,

$$\begin{aligned} K(u,\lambda )&= \frac{1}{\left| \mathbb {S}^2\right| }\iint _{\mathbb {S}^2}{{\,\textrm{Abs}\,}}(u,\lambda ,\nu )\textrm{d}\nu \end{aligned}$$

and

$$\begin{aligned} S(u,\lambda )&= \frac{1}{\left| \mathbb {S}^2\right| }\iint _{\mathbb {S}^2} {{\,\textrm{Sca}\,}}(u,\lambda ,\nu )\big (1-{{\,\textrm{Geo}\,}}(u,\lambda ,\nu )\big )\textrm{d}\nu . \end{aligned}$$

Here, \(\mathbb {S}^2\) denotes the unit sphere and the particle orientation \(\nu \) is assumed to be distributed uniformly random over all possible directions.

Fig. 12
figure 12

Objective function approximation for the screwdriver (50%) design. The blue and orange curve show the results for CSG with fixed step size \({\tau = 0}\) and different coefficients of the outer norm \(\Vert \cdot \Vert _{_{\text {Out}}}\). For Monte Carlo, each inner integral over \(\mathbb {S}^2\) was approximated using 40 random directions. The true objective function value \({J^*\approx 37.84}\) is indicated by the dashed line. The Monte Carlo results are truncated for the sake of readability, as it requires over 8.000 evaluations to reach a good approximation to \(J^*\)

Fig. 13
figure 13

CSG objective function approximations during the optimization process for all initial designs and choice (a) for \(\Vert \cdot \Vert _{_{\text {Out}}}\), i.e., \(c_u=1\), \(c_\lambda =100\) and \(c_\nu =100\). The dashed lines indicate the objective function values of each initial design, respectively

The design domain is a ball of 300 nm diameter, discretized into \({n_0=65752}\) dipole cells. The design \(u\in [\varepsilon ,1]^{n_0}=:\mathcal {U}\) gives the relative amount of hematite to water in each cell, with \(\varepsilon =10^{-4}\). The optical properties of intermediate (grey) material \({u^{(i)}\in (0,1)}\) are generated by linear interpolation between the respective properties of water and hematite. Consequently, each admissible design contains a positive amount of hematite, resulting in lower bounds for \({{\,\textrm{Abs}\,}}\) and \({{\,\textrm{Sca}\,}}\). As stated in Sect. 2.4, [2, Assumptions 2.2–2.4] are satisfied, since changing from Mie theory to the DDA model does not interfere with the smoothness of \({{\,\textrm{Abs}\,}}\), \({{\,\textrm{Sca}\,}}\) and \({{\,\textrm{Geo}\,}}\) w.r.t. \((u,\lambda ,\nu )\), see [19, 28].

Generally, one would combine filtering techniques and greyness penalization to obtain a smooth final design without intermediate material (see, e.g., [29]). However, we explicitly refrain from doing so to present a clear analysis of the CSG performance, without interference from secondary layers of smoothing techniques.

As mentioned above, the change to the DDA model significantly increases the computational cost of evaluating \({{\,\textrm{Sca}\,}}\), \({{\,\textrm{Abs}\,}}\) and \({{\,\textrm{Geo}\,}}\) for a given \({(u,\lambda ,\nu )\in \mathcal {U}\times \Lambda \times \mathbb {S}^2}\). Thus, the deterministic approaches used in the previous setting are no longer computationally feasible.

Furthermore, we want to use this example to analyze the impact of the chosen norm on \(\mathcal {U}\times \Lambda \times \mathbb {S}^2\), appearing in the nearest neighbor calculation, which was already mentioned in [2, Section 3.5]. To be precise, calculating the CSG integration weights requires the definition of an outer norm

$$\begin{aligned} \big \Vert (u^*,\lambda ^*,\nu ^*)\big \Vert _{\text {Out}} = c_u\Vert u^*\Vert _{_\mathcal {U}}+ c_\lambda \Vert \lambda ^*\Vert _{_\Lambda } + c_\nu \Vert \nu ^*\Vert _{_{\mathbb {S}^2}}, \end{aligned}$$

where \(\Vert \cdot \Vert _{_\mathcal {U}}\), \(\Vert \cdot \Vert _{_\Lambda }\) and \(\Vert \cdot \Vert _{_{\mathbb {S}^2}}\) denote norms on the corresponding inner spaces and \(c_u,c_\lambda ,c_\nu >0\). In this application, we choose the Euclidean norm \(\Vert \cdot \Vert _{_2}\) for each inner space. Additionally, we fix \(c_u = 1\), but consider different coefficients \(c_\lambda \) and \(c_\nu \).

Fig. 14
figure 14

Top left to bottom right: Design evolution during the optimization process for the screwdriver (50%) initial design and outer norm (a). The design snapshots were taken every 200 iterations. Red boxes represent design cells consisting of pure hematite. Intermediate material is indicated via a color gradient, where a cell filled with \(50\%\) water and \(50\%\) hematite is colored grey. Based on this gradient, depending on the ratio of hematite and water in a cell, the cell color is shifted to red (more hematite) or blue (more water)

Fig. 15
figure 15

Euclidean distance (after dividing by \(\sqrt{\dim (\mathcal {U})}\) for scaling) between intermediate designs and the respective final design during the SCIBL-CSG optimization process, carried out with outer norm (a)

For the optimization, we consider three different initial designs, which are shown in Fig. 11, top row. The objective function value as well as the values of \(\textbf{L}\), \(\textbf{a}\) and \(\textbf{b}\) for these designs were computed using the CSG method with fixed design, i.e., with constant step size \(\tau =0\), and verified by Monte Carlo (see, e.g., [30]) integration. For one of the initial designs, the objective function value approximation of CSG and Monte Carlo integration with respect to the number of evaluations and different choices of \(\Vert \cdot \Vert _{_{\text {Out}}}\) is shown in Fig. 12.

Each design was optimized with SCIBL-CSG, using inexact hybrid weights for the integration over \(\mathbb {S}^2\) and exact hybrid weights for the integration over \(\Lambda \). For \(\Vert \cdot \Vert _{_{\text {Out}}}\), we considered four different choices of the parameters:

  1. (a)

    \(c_u = 1\), \(c_\lambda =100\) and \(c_\nu = 100\)

  2. (b)

    \(c_u = 1\), \(c_\lambda =1\) and \(c_\nu = 1\)

  3. (c)

    \(c_u = 1\), \(c_\lambda =\tfrac{1}{100}\) and \(c_\nu = 1\)

  4. (d)

    \(c_u = 1\), \(c_\lambda =\tfrac{1}{100}\) and \(c_\nu =\tfrac{1}{100}\)

The results in case (a) for all three initial designs are presented in Fig. 13 and the respective design evolution for the initial design screwdriver (50%), shown in Fig. 11 top row, is depicted in Fig. 14. The corresponding final designs, obtained after 5.000 SCIBL-CSG iterations, are presented in Fig. 11, bottom row. As a second measure for convergence in the design space, the evolution of the norm distance to the respective final designs are shown in Fig. 15 for all three initial designs.

Fig. 16
figure 16

CSG objective function value approximation during the optimization process for the plate (100%) initial design. The dashed line shows the inital objective function value, whereas the different graphs correspond to the choices (a), (b) and (c) for \(\Vert \cdot \Vert _{_\text {Out}}\)

Fig. 17
figure 17

Results for the plate (100%) initial design presented in Fig. 16, augmented by the CSG objective function value approximation in the case that \(\Vert \cdot \Vert _{_\text {Out}}\) was chosen according to (d)

Comparing Figs. 12 and 13, we notice that CSG, using an appropriate outer norm, finds an optimized design almost as fast as it computes the objective function value for a given design. In other words: The full optimization process is only slightly more expensive that the simple evaluation of a single design. Moreover, CSG finds an optimal solution to (2) long before the Monte Carlo approximation to the initial objective function value is converged.

It should, of course, also be noted, that choosing \(\Vert \cdot \Vert _{_\text {Out}}\) should be done with caution, as Fig. 16 shows. While case (a) is, to the best of our knowledge, not optimal by any means, cases (b) and (c) clearly show worse results. Choosing \(\Vert \cdot \Vert _{_\text {Out}}\) extremely poorly, i.e., case (d), can even have devastating effects on the performance, see Fig. 17.

This, however, could also imply that the performance might be significantly improved, if problem specific inner and outer norms would be chosen. Especially in even more complex settings, techniques to obtain such norms a priori, or even during the optimization process itself, represent one of the most important points for further research.

3 Online error estimation

Before we go into theoretical details, we first collect a few key properties and results concerning CSG, which were shown in [2]. In a first simple setting, we consider optimization problems of the form

$$\begin{aligned} \begin{aligned} \min \quad&J(u) \\ \text {s.t.}\quad&u\in \mathcal {U}\subset \mathbb {R}^{d_{\text {o}}}\text { for some }{d_{\text {o}}}\in \mathbb {N}. \end{aligned} \end{aligned}$$
(3)

Additionally, we assume that \(\mathcal {U}\) is compact, and for some \({d_{\text {r}}}\in \mathbb {N}\), there exists an open an bounded set \(\mathcal {X}\subset \mathbb {R}^{d_{\text {r}}}\) and a measure \(\mu \) with \({{\,\textrm{supp}\,}}(\mu )\subset \mathcal {X}\), such that J can be written as \(J(u) = \int _\mathcal {X}j(u,x)\mu (\textrm{d}x)\). The detailed set of assumptions is given in [2, Section 2]. For now, it is only important that \({\nabla _1 j:\mathcal {U}\times \mathcal {X}\rightarrow \mathbb {R}^{d_{\text {o}}}}\) is bounded and Lipschitz continuous, i.e., there exist \(C,L_j>0\) with

$$\begin{aligned} \Vert \nabla _1 j(u,x)\Vert&\le C, \\ \Vert \nabla _1 j(u_1,x_1) - \nabla _1 j(u_2,x_2)\Vert&\le L_j\big (\Vert u_1-u_2\Vert _{_\mathcal {U}}+ \Vert x_1-x_2\Vert _{_\mathcal {X}}\big ) \end{aligned}$$

for all \((u,x),(u_1,x_1),(u_2,x_2)\in \mathcal {U}\times \mathcal {X}\). Due to the finite dimension of all appearing spaces, we can choose arbitrary norms on \(\mathcal {U}\), \(\mathcal {X}\) and \(\mathbb {R}^{d_{\text {o}}}\), and simply denote them by \(\Vert \cdot \Vert _{_\mathcal {U}}\), \(\Vert \cdot \Vert _{_\mathcal {X}}\) and \(\Vert \cdot \Vert \), respectively, unless specific choices are made in numerical experiments.

During the optimization process, CSG computes design dependent integration weights \(\big (\alpha _k\big )_{k=1\ldots ,n}\) (cf. [2, Section 3]) to build an approximation \({\hat{G}}_n\) to the true objective function gradient, based on the available samples from previous iterations \(\big (\nabla _1 j(u_k,x_k)\big )_{k=1,\ldots ,n}\). To be precise, we have

$$\begin{aligned} \nabla J(u) = \int _\mathcal {X}\nabla _1 j(u,x) \mu (\textrm{d}x) \approx \sum _{k=1}^n \alpha _k \nabla _1 j(u_k,x_k) =: {\hat{G}}_n. \end{aligned}$$

It was shown in [2, Lemma 4.6], that

$$\begin{aligned} \Vert \nabla J(u_n)-{\hat{G}}_n\Vert \rightarrow 0 \quad \text {for }n\rightarrow \infty \text { almost surely}. \end{aligned}$$

Carefully investigating the methods to obtain the integration weights, we observe that

$$\begin{aligned} \left\| \nabla J(u_n)-{\hat{G}}_n\right\|&= \left\| \int _\mathcal {X}\nabla _1 j(u_n,x)\mu (\textrm{d}x) - {\hat{G}}_n\right\| \\&= \left\| \sum _{i=1}^n \int _{M_i} \nabla _1 j(u_n,x)\mu (\textrm{d}x) - \sum _{i=1}^n \nabla _1 j(u_i,x_i)\nu _n(M_i)\right\| , \end{aligned}$$

where \(\nu _n\) denotes the measure associated to one of the measures listed in [2, Section 3.6], depending on the choice of integration weights, and

$$\begin{aligned} M_k := \big \{ x\in \mathcal {X}\, : \, \Vert u_n&- u_k \Vert _{_\mathcal {U}}+ \Vert x - x_k\Vert _{_\mathcal {X}}\\ {}&< \Vert u_n - u_j \Vert _{_\mathcal {U}}+ \Vert x - x_j\Vert _{_\mathcal {X}}\text { for all } j\in \{1,\ldots ,n\}\setminus \{k\}\big \}. \end{aligned}$$

By construction, \(M_k\) contains all points \(x\in \mathcal {X}\), such that \((u_n,x)\) is closer to \((u_k,x_k)\) than to any other previous point we evaluated \(\nabla _1 j\) at. For exact integration weights, we have \(\nu _n=\mu \) and thus

$$\begin{aligned} \left\| \nabla J(u_n)-{\hat{G}}_n\right\|&= \left\| \sum _{i=1}^n \int _{M_i} \nabla _1 j(u_n,x)\mu (\textrm{d}x) - \sum _{i=1}^n \int _{M_i} \nabla _1 j(u_i,x_i)\mu (\textrm{d}x)\right\| \\&\le \sum _{i=1}^n \int _{M_i} \left\| \nabla _1 j(u_n,x)-\nabla _1 j(u_i,x_i)\right\| \mu (\textrm{d}x)\\&\le \sum _{i=1}^n \int _{M_i} L_j \cdot \left( \sup _{x\in M_i} Z_n(x) \right) \mu (\textrm{d}x) \\&= L_j\sum _{i=1}^n \mu (M_i)\sup _{x\in M_i} Z_n(x)\\&\le L_j\sup _{x\in \mathcal {X}} Z_n(x). \end{aligned}$$

Here, \(Z_n\) is given by

$$\begin{aligned} Z_n(x):= \min _{k\in \{1,\ldots ,n\}}\big (\Vert u_n-u_k\Vert _{_\mathcal {U}}+ \Vert x-x_k\Vert _{_\mathcal {X}}\big ). \end{aligned}$$

In other words, the approximation error can be bounded in terms of the Lipschitz constant of \(\nabla _1 j\) and the quantity \(Z_n\), which relates to the size of Voronoi cells [31] with positive integration weights.

Both \(L_j\) and \(\sup _{x\in \mathcal {X}} Z_n(x)\) can be efficiently approximated during the optimization process, e.g. by finite differences of the samples \(\big (\nabla _1 j(u_i,x_i)\big )_{i=1,\ldots ,n}\) and by

$$\begin{aligned} \sup _{x\in \mathcal {X}} Z_n(x) \approx \max _{k=1,\ldots ,n} Z_n(x_k), \end{aligned}$$

yielding an online error estimation. Such an approximation may, for example, be used in stopping criteria.

4 Convergence rates

Throughout this section, we assume [2, Assumptions 2.1–2.4] to be satisfied. Moreover, for the entire section, let \((u_n)_{n\in \mathbb {N}}\) correspond to the CSG iterates produced for a fixed random sequence \((x_n)_{n\in \mathbb {N}}\). Then, with probability 1, we have

$$\begin{aligned} \big \Vert {\hat{G}}_n-\nabla J(u_n)\big \Vert \rightarrow 0, \end{aligned}$$

see [2, Lemma 4.6]

4.1 Theoretical background

In the convergence analysis presented in [2], we have already seen that the fashion in which the gradient approximation \({\hat{G}}_n\) is calculated in CSG is crucial for \(\Vert {\hat{G}}_n-\nabla J(u_n)\Vert \rightarrow 0\) and that this property of CSG in turn is the key to all advantages CSG offers in comparison to classic stochastic optimization methods, like convergence for constant steps, backtracking, more involved optimization problems, etc.

The price we pay for this feature lies within the dependency of \({\hat{G}}_n\) on the past iterates. For comparison, the search direction \({\hat{G}}_n^{\text {SG}}\) in a stochastic gradient descent method is given by

$$\begin{aligned} {\hat{G}}_n^{\text {SG}} = \nabla _1 j(u_n,x_n). \end{aligned}$$

Thus, it is independent of all previous steps and fulfills

$$\begin{aligned} \mathbb {E}_\mathcal {X}\left[ {\hat{G}}_n^{\text {SG}}\right] = \mathbb {E}_\mathcal {X}\big [ \nabla _1 j(u_n,\cdot )\big ] = \nabla J(u_n), \end{aligned}$$

i.e., it is an unbiased sample of the full gradient. The combination of these properties allows for a straightforward convergence rate analysis, see, e.g., [32].

In contrast, \({\hat{G}}_n\) is in general not an unbiased approximation to \(\nabla J(u_n)\) and moreover not independent of \(\big (u_i,x_i)_{i=1,\ldots ,n-1}\). The main problem in finding the convergence rate of \(\Vert u_{n+1}-u_n\Vert _{_\mathcal {U}}\rightarrow 0\) is, that this quantity depends on the approximation error \(\Vert {\hat{G}}_n-\nabla J(u_n)\Vert \), which, as we have seen in Sect. 3, depends on \(Z_n\). Since \(Z_n\) itself is deeply connected to \(\min _k\Vert u_{n} - u_k\Vert _{_\mathcal {U}}\), we run into a circular argument.

Therefore, up to now, we are not able to prove convergence rates for the CSG iterates. We can, however, state a prediction to this rate and provide numerical evidence.

Conjecture 4.1

We conjecture that the CSG method, applied to problem (3), using a constant step size \(\tau < \tfrac{2}{L}\) and empirical integration weights, fulfills

$$\begin{aligned} \Vert u_{n+1} - u_n\Vert _{_\mathcal {U}}= \mathcal {O}\left( \ln (n)\cdot n^{-\tfrac{1}{\max \{2,{d_{\text {r}}}\}}}\right) \end{aligned}$$

with probability 1.

To motivate this claim, note that, in the proof of [2, Lemma 4.6], it was shown that there exists \(C>0\) such that

$$\begin{aligned} \left\| {\hat{G}}_n-\nabla J(u_n)\right\| \le C \left( \int _\mathcal {X}Z_n(x)\mu (\textrm{d}x) + d_{_W}(\mu _n,\mu )\right) , \end{aligned}$$

where \(d_{_W}\) denotes the Wasserstein distance of the two measures \(\mu _n\) and \(\mu \). By [33, Theorem 1], the empirical measure \(\mu _n\) satisfies

$$\begin{aligned} \mathbb {E}\big [d_{_W}(\mu _n,\mu )\big ] \le C({d_{\text {r}}})\cdot \left( \int _\mathcal {X}\Vert x\Vert _{_\mathcal {X}}^3\mu (\textrm{d}x)\right) ^{\tfrac{1}{3}}\cdot {\left\{ \begin{array}{ll} \tfrac{1}{\sqrt{n}} &{} \text {if }{d_{\text {r}}}= 1, \\ \tfrac{\ln (1+n)}{\sqrt{n}} &{} \text {if } {d_{\text {r}}}= 2, \\ n^{-\tfrac{1}{{d_{\text {r}}}}} &{} \text {if }{d_{\text {r}}}\ge 3.\end{array}\right. } \end{aligned}$$

This result is the main motivation for Conjecture 4.1. It can be shown that the rate \(n^{-1/{d_{\text {r}}}}\) for \({d_{\text {r}}}\ge 3\) is sharp if \(\mu \) corresponds to a uniform distribution on \(\mathcal {X}\). Thus, in this case, it is reasonable to assume a uniform distribution also corresponds to the worst-case rate of \(\int _\mathcal {X}Z_n(s)\mu (\textrm{d}x)\rightarrow 0\). Assuming that the difference in designs appearing in \(Z_n\) is negligible due to the overall convergence of CSG, we obtain the rate

$$\begin{aligned} \sup _{x\in \mathcal {X}}\; Z_n(x) = \mathcal {O}\left( \ln (n)\cdot n^{-\tfrac{1}{\max \{2,{d_{\text {r}}}\}}}\right) . \end{aligned}$$

To see this, we fill \(\mathcal {X}\subset \mathbb {R}^{{d_{\text {r}}}}\) with balls (w.r.t. the norm \(\Vert \cdot \Vert _{_\mathcal {X}}\)) of radius \({\varepsilon }>0\) and denote by \(N({\varepsilon })\in \mathbb {N}\) the number of cells. Due to the dimension of \(\mathcal {X}\), we have \(\mathcal {O}\big (N({\varepsilon })\big )={\varepsilon }^{-{d_{\text {r}}}}\). Now, to achieve \(\sup _{x\in \mathcal {X}} Z_n(x) < {\varepsilon }\), we need each of these cells to contain at least one of the sample points \((x_i)_{i=1,\ldots ,n}\). It is well-known that the expected number of samples we need to draw for this to happen is given by

$$\begin{aligned} N({\varepsilon })\sum _{k=1}^{N({\varepsilon })}\frac{1}{k} = \mathcal {O}\left( -{\varepsilon }^{-{d_{\text {r}}}}\ln ({\varepsilon })\right) , \end{aligned}$$

where we used

$$\begin{aligned} \sum _{k=1}^n\frac{1}{k} = \mathcal {O}\big (\ln (n)\big ) \quad \text {for }n\rightarrow \infty . \end{aligned}$$

In other words, the convergence rates of \({\int _\mathcal {X}Z_n(x)\mu (\textrm{d}x)\rightarrow 0}\) and \({d_{_W}(\mu _n,\mu )\rightarrow 0}\) are comparable.

Now that we motivated the rates claimed in Conjecture 4.1 for the approximation error \(\Vert {\hat{G}}_n - \nabla J(u_n)\Vert \), we use the following proposition to show that the rates of \(\Vert u_{n+1}-u_n\Vert _{_\mathcal {U}}\rightarrow 0\) can not be worse.

Proposition 4.2

Assume that the approximation error \(\Vert {\hat{G}}_n-\nabla J(u_n)\Vert \) satisfies

$$\begin{aligned} \Vert {\hat{G}}_n - \nabla J(u_n)\Vert = \mathcal {O}\left( \ln (n)\cdot n^{-\tfrac{1}{\max \{2,{d_{\text {r}}}\}}}\right) . \end{aligned}$$

Then, under the assumptions of Conjecture 4.1, it holds

$$\begin{aligned} \Vert u_{n+1}-u_n\Vert _{_\mathcal {U}}= \mathcal {O}\left( \ln (n)\cdot n^{-\tfrac{1}{\max \{2,{d_{\text {r}}}\}}}\right) . \end{aligned}$$

Proof

Assume for contradiction that this is not the case. Thus, there exists \(N\in \mathbb {N}\) such that

$$\begin{aligned} \left\| \nabla J(u_n)-{\hat{G}}_n\right\| \le \tfrac{1}{2}\left( \tfrac{1}{\tau }-\tfrac{L}{2}\right) \Vert u_{n+1}-u_n\Vert _{_\mathcal {U}}\quad \text {for all }n\ge N. \end{aligned}$$
(4)

By the descent lemma [34, Lemma 5.7], the characteristic property of the projection operator [34, Theorem 6.41] and the Cauchy-Schwarz inequality, we obtain

$$\begin{aligned} J(u_{n+1})&-J(u_n) \\&\le \nabla J(u_n)^\top (u_{n+1}-u_n) + \tfrac{L}{2}\Vert u_{n+1}-u_n\Vert _{_\mathcal {U}}^2 \\&= {\hat{G}}_n^\top (u_{n+1}-u_n) + \tfrac{L}{2}\Vert u_{n+1}-u_n\Vert _{_\mathcal {U}}^2 + \left( \nabla J(u_n)-{\hat{G}}_n\right) ^\top (u_{n+1}-u_n) \\&\le \left( \tfrac{L}{2}-\tfrac{1}{\tau }\right) \Vert u_{n+1}-u_n\Vert _{_\mathcal {U}}^2 + \left\| \nabla J(u_n)-{\hat{G}}_n\right\| \cdot \Vert u_{n+1}-u_n\Vert _{_\mathcal {U}}\\&= \left( \left( \tfrac{L}{2}-\tfrac{1}{\tau }\right) \Vert u_{n+1}-u_n\Vert _{_\mathcal {U}}+ \left\| \nabla J(u_n)-{\hat{G}}_n\right\| \right) \Vert u_{n+1}-u_n\Vert _{_\mathcal {U}}. \end{aligned}$$

Combining this with (4) gives \(J(u_{n+1})\le J(u_n)\) for all \(n\ge N\), since \(\tfrac{L}{2}<\tfrac{1}{\tau }\). Thus, the sequence of objective function values \(\big (J(u_n)\big )_{n\in \mathbb {N}}\) is monotonically decreasing for all \(n\ge N\). By continuity of J and compactness of \(\mathcal {U}\), J is bounded and \(J(u_n)\rightarrow {\bar{J}}\) for some \({\bar{J}}\in \mathbb {R}\). Therefore,

$$\begin{aligned} -\infty < {\bar{J}} - J(u_N) = \sum _{n=N}^\infty \big ( J(u_{n+1}-J(u_n)\big ) \le \tfrac{1}{2}\left( \tfrac{L}{2}-\tfrac{1}{\tau }\right) \sum _{n=N}^\infty \Vert u_{n+1}-u_n\Vert _{_\mathcal {U}}^2. \end{aligned}$$

Hence, the series

$$\begin{aligned} \sum _{n=N}^\infty \Vert u_{n+1}-u_n\Vert _{_\mathcal {U}}^2 \end{aligned}$$

converges, contradicting \(\Vert u_{n+1}-u_n\Vert _{_\mathcal {U}}\ne \mathcal {O}\left( \ln (n)\cdot n^{-\tfrac{1}{\max \{2,{d_{\text {r}}}\}}}\right) \). \(\square \)

4.2 Numerical verification

We want to verify the proclaimed rates numerically. For this purpose, we consider two optimization problems that can easily be scaled to high dimensions. The first problem is given by

$$\begin{aligned} \min _{u\in \mathcal {U}}\quad \frac{1}{2}\int _\mathcal {X}\big \Vert u-x\big \Vert _2^2 \textrm{d}x, \end{aligned}$$
(5)

where \(\mathcal {X}= \left[ -\tfrac{1}{2},\tfrac{1}{2}\right] ^{{d_{\text {r}}}}\) and \(\mathcal {U}= [-5,5]^{{d_{\text {r}}}}\), i.e., \(\mathcal {U}\) and \(\mathcal {X}\) have the same dimension. The second problem,

$$\begin{aligned} \min _{u\in \mathcal {U}}\quad \frac{1}{2}\int _{-0.5}^{0.5}\big \Vert u - x\cdot \mathbbm {1}_{{d_{\text {o}}}}\big \Vert _2^2\textrm{d}x, \end{aligned}$$
(6)

fixes \({d_{\text {r}}}= 1\), while \(\mathcal {U}=[-5,5]^{{d_{\text {o}}}}\). Here, \(\mathbbm {1}_{{d_{\text {o}}}}\) represents the vector \((1,1,\ldots ,1)^\top \in \mathbb {R}^{{d_{\text {o}}}}\). Note that, in both settings, we have \(L_j = 1\). Thus, by Sect. 3, we have

$$\begin{aligned} \big \Vert {\hat{G}}_n - \nabla J(u_n)\big \Vert _2 \le \sup _{x\in \mathcal {X}}\; Z_n(x) \approx \max _{k=1,\ldots ,n} Z_n(x_k). \end{aligned}$$

The optimal solution to (5) and (6) is given by the zero vector \(u^*= 0\in \mathcal {U}\).

Fig. 18
figure 18

The bold lines represent the median values of \(\max _{k=1,\ldots ,n}Z_n(x_k)\) for the equidistant problem (5) with respect to the iteration counter. The different colors indicate the different dimensions \({d_{\text {r}}}\in \{1,2,\ldots ,500\}\). The dotted lines correspond to the respective predicted rates \(n^{-\tfrac{1}{\max \{2,{d_{\text {r}}}\}}}\). Since the predictions for \({d_{\text {r}}}=1\) and \({d_{\text {r}}}=2\) are equal, only the case \({d_{\text {r}}}=2\) is shown

Fig. 19
figure 19

Median values of \(\Vert u_n-u^*\Vert \) in the equidimensional setting (5) for different choices of \({d_{\text {r}}}\in \{1,2,\ldots ,500\}\). For each dimension, the predicted worst-case asymptotic line \(n^{-\tfrac{1}{\max \{2,{d_{\text {r}}}\}}}\) is indicated by the dotted line. Again, we omit the prediction for \({d_{\text {r}}}=1\), since it has the same slope as in the case for \({d_{\text {r}}}=1\)

Fig. 20
figure 20

Results for the median of \(\max _{k=1,\ldots ,n}Z_n(x_k)\) in setting (6) for different dimensions \({d_{\text {o}}}\in \{1,2,\ldots ,1000\}\), indicated by different colors. As we conjectured, the asymptotic slope of all curves is equal, since \({d_{\text {r}}}=1\) is fixed. As a point of reference, we added the graph of \(n^{-0.65}\), represented by the dotted line

Fig. 21
figure 21

Median distance to the optimal solution \(u^*\) during the course of the iterations for \({d_{\text {o}}}\in \{1,2,\ldots ,1000\}\). Again, the asymptotic slope of all curves is equal and we added the line corresponding to \(n^{-0.65}\) for comparison

In our analysis, for different values of the dimensions \({d_{\text {r}}},{d_{\text {o}}}\in \mathbb {N}\), problems (5) and (6) were initialized with 500 random starting points. The constant step size of CSG was chosen as \(\tau = \tfrac{1}{2}\). We track \(\Vert u_n - u^*\Vert _2\) and \(\max _{k=1,\ldots ,n} Z_n(x_k)\) during the optimization process and compare the median of the 500 runs to the rates predicted in Conjecture 4.1. The results can be seen in Figs. 181920, and 21. Note that, for the plots of the predicted rates, we omitted the factor \(\ln (n)\). Therefore, the corresponding graphs are straight lines, where the slope \(-\tfrac{1}{\max \{2,{d_{\text {r}}}\}}\) is equal to the asymptotic slope of the predicted rate, since

$$\begin{aligned} \ln (n)\cdot n^{-\tfrac{1}{\max \{2,{d_{\text {r}}}\}}} = \mathcal {O}\left( n^{-\tfrac{1}{\max \{2,{d_{\text {r}}}\}}+{\varepsilon }}\right) \quad \text {for all }{\varepsilon }>0. \end{aligned}$$

In the equidimensional, i.e., \(\dim (\mathcal {X})=\dim (\mathcal {U})\), setting (5), the experimentally obtained values for \(Z_n\) almost perfectly match the claimed rates. For \(\Vert u_n-u^*\Vert _2\), the observed rates also match the predictions for very small and large dimensions. For \({d_{\text {r}}}=3,4,5\), the convergence obtained in the experiments was even slightly faster than predicted. Investigating the results for (6), it is clearly visible that increasing the design dimension \({d_{\text {o}}}\), while keeping the parameter dimension \({d_{\text {r}}}\) fixed, has no influence on the obtained rates of convergence, indicating that CSG is able to efficiently handle large-scale optimization problems.

4.3 Circumventing slow convergence

As we have seen so far, the convergence rate of the CSG method worsens with increasing dimension of integration \({d_{\text {r}}}\in \mathbb {N}\). However, it is possible to circumvent this behavior, if the problem admits additional structure. Assume that there exist suitable \(\mathcal {X}_1,\mathcal {X}_2,\mu _1,\mu _2,f_1\) and \(f_2\) such that the objective function appearing in (3) can be rewritten as

$$\begin{aligned} J(u) = \int _\mathcal {X}j(u,x)\mu (\textrm{d}x) = \int _{\mathcal {X}_1} f_1 \left( u,x,\int _{\mathcal {X}_2} f_2(u,y)\mu _2(\textrm{d}y)\right) \mu _1(\textrm{d}x). \end{aligned}$$

Assume further, that \(\mathcal {X}_1,\mathcal {X}_2,\mu _1,\mu _2,f_1\) and \(f_2\) satisfy the corresponding equivalents of [2, Assumptions 2.1–2.4].

Now, we can independently calculate integration weights \((\beta _k)_{k=1,\ldots ,n}\) and \((\alpha _k)_{k=1,\ldots ,n}\) for the integrals over \(\mathcal {X}_1\) and \(\mathcal {X}_2\), respectively. The corresponding CSG approximations (indicated by hats) are then given by

$$\begin{aligned} f^{(n)}&:= \int _{\mathcal {X}_2} f_2(u,y)\mu _2(\textrm{d}y) \approx \sum _{i=1}^n \alpha _i f_2(u_i,y_i) =: {\hat{f}}_n, \\ g^{(n)}&:= \int _{\mathcal {X}_2} \nabla _1 f_2(u,y)\mu _2(\textrm{d}y) \approx \sum _{i=1}^n \alpha _i\nabla _1 f_2(u_i,y_i) =: {\hat{g}}_n, \\ \nabla J(u_n)&\approx \sum _{i=1}^n\beta _i\Big ( \nabla _1 f_1 (u_i,x_i,{\hat{f}}_i) + \nabla _3 f_1(u_i,x_i,{\hat{f}}_i)\cdot {\hat{g}}_i\Big )=:{\hat{G}}_n. \end{aligned}$$

The same steps as performed in the proof of [2, Lemma 4.6] yield the existence of a constant \(C_1>0\), depending only on the Lipschitz constants of \(\nabla f_1\) and \(\nabla f_2\), such that

$$\begin{aligned}&\Big \Vert \nabla J(u_n) - {\hat{G}}_n \Big \Vert \nonumber \\&\le C_1\! \Big ( d_{_W}(\mu _1,\nu ^{\beta }_n)+\sup _{x\in \mathcal {X}_1}\min _{k=1,\ldots ,n}\!\!\big ( \Vert u_n - u_k\Vert _{_\mathcal {U}}\!\!\! + \Vert x - x_k\Vert _{_{\mathcal {X}_1}}\!\!\! + \vert {\hat{f}}_n - {\hat{f}}_k\vert \big ) \Big ). \end{aligned}$$
(7)

Here, \(\nu ^\beta _n\) corresponds to the measure related to the integration weights \((\beta _k)_{k=1,\ldots ,n}\), see [2, Assumption 2.4]. Now, denoting by \(C_2>0\) a constant depending on the Lipschitz constant \(L_{f_2}\) of \(f_2\), we decompose the last term:

$$\begin{aligned}&\vert {\hat{f}}_n - {\hat{f}}_k\vert \nonumber \\&\le \vert {\hat{f}}_n - f_n\vert + \vert {\hat{f}}_k - f_k\vert + \vert f_n-f_k\vert \nonumber \\&\le \vert {\hat{f}}_n - f_n\vert + \vert {\hat{f}}_k - f_k\vert + L_{f_2} \Vert u_n-u_k\Vert _{_\mathcal {U}}\nonumber \\&\le C_2\Big (\Vert u_n-u_k\Vert _{_\mathcal {U}}+ \sup _{y\in \mathcal {X}_2}\min _{i=1,\ldots ,n} \big ( \Vert u_n - u_i\Vert _{_\mathcal {U}}+ \Vert y - y_i\Vert _{_{\mathcal {X}_2}}\big ) \nonumber \\&\quad + \sup _{y\in \mathcal {X}_2}\min _{i=1,\ldots ,k} \big ( \Vert u_k - u_i\Vert _{_\mathcal {U}}+ \Vert y - y_i\Vert _{_{\mathcal {X}_2}}\big ) +d_{_W}(\mu _2,\nu ^{\alpha }_n) + d_{_W}(\mu _2,\nu ^{\alpha }_k)\Big ) \nonumber \\&= C_2\Big (\Vert u_n\! -u_k\Vert _{_\mathcal {U}}\! + \!\sup _{y\in \mathcal {X}_2}\! Z_n(y) + \!\sup _{y\in \mathcal {X}_2}\! Z_k(y) +d_{_W}(\mu _2,\nu ^{\alpha }_n) + d_{_W}(\mu _2,\nu ^{\alpha }_k)\Big ). \end{aligned}$$
(8)

Assuming that the convergence of the sequence \((u_n)_{n\in \mathbb {N}}\) generated by the CSG method implies

$$\begin{aligned} \mathcal {O}\left( \sup _{y\in \mathcal {X}_2} Z_n (y)\right) = \mathcal {O}\left( \sup _{y\in \mathcal {X}_2} Z_k (y)\right) \quad \text {and}\quad \mathcal {O}\big ( d_{_W}(\mu _2,\nu ^{\alpha }_n)\big ) = \mathcal {O}\big ( d_{_W}(\mu _2,\nu ^{\alpha }_k)\big ), \end{aligned}$$

we insert (8) into (7), to obtain

$$\begin{aligned} \big \Vert \nabla J(u_n)-{\hat{G}}_n\Vert \le C(C_1,C_2)\Big ( d_{_W}(\mu _1,\nu ^{\beta }_n) + d_{_W}(\mu _2,\nu ^{\alpha }_n) + \sup _{x\in \mathcal {X}_1} Z_n(x) + \sup _{y\in \mathcal {X}_2} Z_n(y)\Big ). \end{aligned}$$

Therefore, by the same arguments as in Sect. 4.1, we conjecture

$$\begin{aligned} \big \Vert \nabla J(u_n)-{\hat{G}}_n\big \Vert&= \mathcal {O}\left( \ln (n)\cdot n^{-\tfrac{1}{\max \{2,\dim (\mathcal {X}_1),\dim (\mathcal {X}_2)\}}}\right) , \\ \Vert u_{n+1}-u_n\Vert _{_\mathcal {U}}&= \mathcal {O}\left( \ln (n)\cdot n^{-\tfrac{1}{\max \{2,\dim (\mathcal {X}_1),\dim (\mathcal {X}_2)\}}}\right) . \end{aligned}$$

In conclusion, we conjecture that, assuming the objective function can be rewritten in terms of nested expectation values

$$\begin{aligned} J(u) = \int _{\mathcal {X}_1} f_1\left( u,x_1,\int _{\mathcal {X}_2}f_2\left( u,x_2,\int _{\mathcal {X}_3}f_3(\cdots )\mu _3(\textrm{d}x_3) \right) \mu _2(\textrm{d}x_2)\right) \mu _1(\textrm{d}x_1), \end{aligned}$$

the convergence rate of the CSG method depends only on the largest dimension of the occurring \(\mathcal {X}_i\), which may be much lower when compared to \(\dim (\mathcal {X})\).

Since this is again a claim and not a rigorous proof, we validate this assumption numerically. For this, we once more consider (5) and initialize it with 500 random starting points. This time, however, we utilize the fact that the objective function can be written as

$$\begin{aligned} J(u) = \frac{1}{2}\int _{\mathcal {X}} \Vert u-x\Vert _2^2\textrm{d}x = \frac{1}{2}\int _{\mathcal {X}} \Big ( \sum _{i=1}^{{d_{\text {r}}}}(u_i-x_i)^2\Big ) \textrm{d}x = \frac{1}{2}\sum _{i=1}^{{d_{\text {r}}}} \int _{-\tfrac{1}{2}}^{\tfrac{1}{2}}(u_i-x_i)^2\textrm{d}x_i. \end{aligned}$$

Thus, we can group the independent coordinates into subintegrals of arbitrary dimension, allowing us to study our claim for a large number of different regroupings without having to change the whole problem formulation. The results for several different decompositions and 500 random starting points in the case \({d_{\text {r}}}=100\) are shown in Fig. 22. The improved rates of convergence are clearly visible, independent on whether the subgroup dimensions are equal or not. As claimed above, the highest remaining dimension of integration determines the overall convergence rate of CSG.

Fig. 22
figure 22

Median total error \(\Vert u_n-u^*\Vert _2\) of the CSG iterates for (5), for \({d_{\text {r}}}=100\). The integral over \(\mathcal {X}=\left[ -\tfrac{1}{2},\tfrac{1}{2}\right] ^{{d_{\text{ r }}}}\) has been decomposed into several integrals of smaller dimension. The labels in the bottom left give details about the decomposition, e.g., the orange line corresponds to splitting the whole integral into one integral of dimension 75 and 5 integrals of dimension 5. The dotted line indicates the expected rate of convergence obtained by the CSG method without splitting up the integral

5 Conclusion and outlook

In this contribution, we presented a numerical analysis of the CSG method. The practical performance of CSG was tested for two applications from nanoparticle design optimization with varying computational complexity. For the low-dimensional problem formulation, CSG was shown to perform superior when compared to the commercial fmincon blackbox solver. The high-dimensional setting provided an example, for which classic optimization schemes (stochastic as well as deterministic) from literature do not provide optimal solutions within reasonable time.

Convergence rates for CSG with constant step size were proposed and analytically motivated. They were shown to agree with numerically obtained convergence rates in several different instances. Moreover, in the case that the objective function admits additional structure, techniques to circumvent slow convergence for high dimensional integration domains were presented.

While the proposed convergence rates for CSG agree with our experimental results, it remains an open question if they can be proven rigorously. Furthermore, even though the choice of a metric for the nearest neighbor approximation in the integration weights is irrelevant for the convergence results, a problem specific metric could significantly improve the performance of CSG by exploiting additional structure, which might be lost by utilizing an arbitrary metric. How to automatically obtain such a metric during the optimization process requires further research.