Keywords

1 Introduction

Evolution Strategies (ES) [12, 13] are well-recognized Evolutionary Algorithms suited for real-valued non-linear optimization. State-of-the-art ES such as the CMA-ES [8] or its simplification [5] are also well-suited for locating global optimizers in highly multimodal fitness landscapes. While the CMA-ES was originally mainly intended for non-differentiable optimization problems, but yet regarded as a locally acting strategy, it was already in [7] observed that using a large population size can make the ES a strategy that is able to locate the global optimizer among a huge number of local optima. This is a surprising observation when considering the ES as a strategy that acts mainly local in the search space following some kind of gradient or natural gradient [3, 6, 11]. As one can easily check using standard (highly) multimodal test functions such as Rastrigin, Ackley, and Griewank to name a few, this ES property is not intimately related to the covariance matrix adaptation (CMA) ES which generates non-isotropic correlated mutations, but can also be found in \((\mu /\mu _I, \lambda )\)-ES with isotropic mutations. Therefore, if one wants to understand the underlying working principles how the ES locates the global optimizer, the analysis of the \((\mu /\mu _I, \lambda )\)-ES should be the starting point.

The question regarding why and when optimization algorithms – originally designed for local search – are able to locate global optima has gained attention in the last few years. A recurring idea comes from relaxation procedures that transform the original multimodal optimization problem into a convex optimization problem called Gaussian continuation [9]. Gaussian continuation is nothing else but a convolution of the original optimization problem with a Gaussian kernel. As has been shown in [10], using the right Gaussian, Rastrigin-like functions can be transformed into a convex optimization problem, thus making it accessible to gradient following strategies. However, this raises the question how to perform the convolution efficiently. One road followed in [14] uses high-order Gauss-Hermite integration in conjunction with a gradient descent strategy yielding surprisingly good results. The other road coming to mind is approximating the convolution by Gaussian sampling. This resembles the procedure ES do: starting from a parental state, offspring are generated by Gaussian mutations. The problem is, however, that in order to get a reliable gradient, a huge number of samples, i.e. offspring in ES must be generated in order to get reliable convolution results. The number of offspring needed to get reliable estimates seems much larger than the offspring population size needed in ES experiments conducted in [7] showing approximately a linear relation between problem dimension N and population size for the Rastrigin function. Therefore, understanding the ES performance from viewpoint of Gaussian relaxation does not seem to help much.

The approach followed in this paper will incorporate two main concepts, namely a progress rate analysis as well as its application within the so-called evolution equations modeling the transition dynamics of the ES [2]. The progress rate measure yields the expected positional change in search space between two generations depending on location, strategy and test function parameters. Aiming to investigate and understand the dynamics of globally converging ES runs, the progress rate is an essential quantity to model the expected evolution dynamics over many generations.

This paper provides first results of a scientific program that aims at an analysis of the performance of the \((\mu /\mu _I, \lambda )\)-ES on Rastrigin’s test function based on a first order progress rate. After a short introduction of the \((\mu /\mu _I, \lambda )\)-ES, the N-dimensional first order progress will be defined and an approximation will be derived resulting in a closed form expression. The predictive power and its limitations will be checked by one-generation experiments. The progress rate will then be used to simulate the ES dynamics on Rastrigin using difference equations. This simulation will be compared with real runs of the \((\mu /\mu _I, \lambda )\)-ES. In a concluding section a summary of the results and outlook of the future research will be given.

2 Rastrigin Function and Local Quality Change

The real-valued minimization problem defined for an N-dimensional search vector \(\mathbf {y} = (y_1, ..., y_N)\) is performed on the Rastrigin test function f given by

$$\begin{aligned} f(\mathbf {y}) = \sum _{i=1}^{N} f_i(y_i) = \sum _{i=1}^{N} y_i^2 + A - A\cos (\alpha y_i), \end{aligned}$$
(1)

with A denoting the oscillation amplitude and \(\alpha =2\pi \) the corresponding frequency. The quadratic term with superimposed oscillations yields a finite number of local minima M for each dimension i, such that the overall number of minima scales exponentially as \(M^N\) posing a highly multimodal minimization problem. The global optimizer is at \(\hat{\mathbf {y}}=\mathbf {0}\).

For the progress rate analysis in Sect. 4 the local quality function \(Q_{\mathbf {y}}(\mathbf {x})\) at \(\mathbf {y}\) due to mutation vector \(\mathbf {x} = (x_1, ..., x_N)\) is needed. In order to reuse results from noisy progress rate theory it will be formulated for the maximization case of \(F(\mathbf {y}) = -f(\mathbf {y})\) with \(F_i(y_i) = -f_i(y_i)\), such that local quality change yields

$$\begin{aligned} Q_{\mathbf {y}}(\mathbf {x})= F(\mathbf {y} + \mathbf {x}) - F(\mathbf {y}) = f(\mathbf {y}) - f(\mathbf {y} + \mathbf {x}). \end{aligned}$$
(2)

\(Q_{\mathbf {y}}(\mathbf {x})\) can be evaluated for each component i independently giving

$$\begin{aligned}&Q_{\mathbf {y}}(\mathbf {x})= \sum _{i=1}^{N} Q_i(x_i) = \sum _{i=1}^{N} f_i(y_i) - f_i(y_i + x_i) \end{aligned}$$
(3)
$$\begin{aligned}&= \sum _{i=1}^{N} -\left( x_i^2 +2 y_i x_i + A\cos {\left( \alpha y_i \right) }(1 - \cos {\left( \alpha x_i \right) }) + A\sin {\left( \alpha y_i \right) }\sin {\left( \alpha x_i \right) }\right) . \end{aligned}$$
(4)

A closed form solution of the progress rate appears to be obtainable only for a linearized expression of \(Q_i(x_i)\). A first approach taken in this paper is based on a Taylor expansion for the mutation \(x_i\) and discarding higher order terms

$$\begin{aligned} Q_i(x_i) = F_i(y_i+x_i) - F_i(y_i)&= \frac{\partial F_i}{\partial y_i}x_i + O(x_i^2) \end{aligned}$$
(5)
$$\begin{aligned}&\approx \left( -2y_i - \alpha A \sin {\left( \alpha y_i \right) }\right) x_i =: -f'_i x_i, \end{aligned}$$
(6)

using the following derivative terms

$$\begin{aligned} k_i = 2y_i \quad \text { and }\quad d_i = \alpha A \sin (\alpha y_i), \quad \text { such that }\quad \frac{\partial f_i}{\partial y_i}=f'_i=k_i+d_i. \end{aligned}$$
(7)

A second approach is to consider only the linear term of Eq. (4) and neglect all non-linear terms denoted by \(\delta (x_i)\) according to

$$\begin{aligned} Q_i(x_i)&= -2 y_i x_i - x_i^2 - A\cos {\left( \alpha y_i \right) }(1 - \cos {\left( \alpha x_i \right) }) - A\sin {\left( \alpha y_i \right) }\sin {\left( \alpha x_i \right) } \end{aligned}$$
(8)
$$\begin{aligned}&= -2 y_i x_i + \delta (x_i) \approx -2 y_i x_i = -k_i x_i. \end{aligned}$$
(9)

The linearization using \(f'_i\) is a local approximation of the function incorporating oscillation parameters A and \(\alpha \). Using only \(k_i\) (setting \(d_i=0\)) discards oscillations by approximating the quadratic term via \(k_i = \partial (y_i^2)/\partial y_i = 2y_i\) with negative sign due to maximization. Both approximations will be evaluated later.

3 The \((\mu /\mu _I, \lambda )\text {-ES}\) with Normalized Mutations

The Evolution Strategy under investigation consists of a population of \(\mu \) parents and \(\lambda \) offspring (\(\mu < \lambda \)) per generation g. Algorithm 1 is presented below and offspring variables are denoted with overset “\(\sim \)”.

Population variation is achieved by applying an isotropic normally distributed mutation \(\mathbf {x} \sim \sigma \mathcal {N}(0, \mathbf {1})\) with strength \(\sigma \) to the parent recombinant in Lines 6 and 7. The recombinant is obtained using intermediate recombination of all \(\mu \) parents equally weighted in Line 11. Selection of the \(m=1,...,\mu \) best search vectors \(\mathbf {y}_{m;\lambda }\) (out of \(\lambda \)) according to their fitness is performed in Line 10.

Note that the ES in Algorithm 1 operates under constant normalized mutation \(\sigma ^*\) in Lines 3 and 12 using the spherical normalization

$$\begin{aligned} \sigma ^* = \frac{\sigma ^{(g)} N}{\big \Vert {\mathbf {y}^{(g)}}\big \Vert } = \frac{\sigma ^{(g)}N}{R^{(g)}} . \end{aligned}$$
(10)

This property ensures global convergence of the algorithm as the mutation strength \(\sigma ^{(g)}\) decreases if and only if the residual distance \(\big \Vert {\mathbf {y}^{(g)}}\big \Vert = R^{(g)}\) decreases. While \(\sigma ^*\) is not known during black-box optimizations, it is used here to investigate the dynamical behavior of the ES using the first order progress rate approach to be developed in this paper. Incorporating self-adaptation of \(\sigma \) or cumulative step-size adaptation remains for future research.

figure a

4 Progress Rate

4.1 Definition

Having introduced the Evolution Strategy, we are interested in the expected one-generation progress of the optimization on the Rastrigin function (1) before investigating the dynamics over multiple generations.

A first order progress rate \(\varphi _i\) for the i-th component between two generations \(g \rightarrow g+1\) can be defined as the expectation value over the positional difference of the parental components

$$\begin{aligned} \varphi _i = {\text {E}}\left[ \,y_i^{(g)} - y_i^{(g+1)}\,\big |\,\sigma ^{(g)}, \mathbf {y}^{(g)}\,\right] = y_i^{(g)} - {\text {E}}\left[ \,y_i^{(g+1)}\,\big |\,\sigma ^{(g)}, \mathbf {y}^{(g)}\,\right] , \end{aligned}$$
(11)

given mutation strength \(\sigma ^{(g)}\) and the position \(\mathbf {y}^{(g)}\). First, an expression for \(\mathbf {y}^{(g+1)}\) is needed, see Algorithm 1, Line 11. It is the result of mutation, selection and recombination of the \(m=1, ..., \mu \) offspring vectors yielding the highest fitness, such that \(\mathbf {y}^{(g+1)} = \frac{1}{\mu }\sum _{m=1}^{\mu }\tilde{\mathbf {y}}_{m;\lambda } = \frac{1}{\mu }\sum _{m=1}^{\mu }(\mathbf {y}^{(g)} + \mathbf {x})_{m;\lambda }\). Considering the i-th component, noting that \(\mathbf {y}^{(g)}\) is the same for all offspring and setting \(\left( \mathbf {x}_{m;\lambda }\right) _i = x_{m;\lambda }\) one has

$$\begin{aligned} y_i^{(g+1)} = \frac{1}{\mu }\sum _{m=1}^{\mu }(y_i^{(g)} + x_{m;\lambda }) = y_i^{(g)} + \frac{1}{\mu }\sum _{m=1}^{\mu }x_{m;\lambda }. \end{aligned}$$
(12)

Taking the expectation \({\text {E}}\left[ y_i^{(g+1)}\right] \), setting \(x = \sigma z = \sigma \mathcal {N}(0,1)\) and inserting the expression back into (11) yields

$$\begin{aligned} \varphi _i = -\frac{1}{\mu }{\text {E}}\left[ \sum _{m=1}^{\mu }x_{m;\lambda } \bigg | \sigma ^{(g)}, \mathbf {y}^{(g)} \right] = -\frac{\sigma }{\mu }{\text {E}}\left[ \sum _{m=1}^{\mu }z_{m;\lambda } \bigg | \sigma ^{(g)}, \mathbf {y}^{(g)}\right] . \end{aligned}$$
(13)

Therefore progress can be evaluated by averaging over the expectations of \(\mu \) selected mutation contributions. In principle this task can be solved by deriving the induced order statistic density \(p_{m;\lambda }\) for the m-th best individual and subsequently solving the integration over the i-th component

$$\begin{aligned} \varphi _i = -\frac{1}{\mu } \sum _{m=1}^{\mu }\int _{-\infty }^{\infty } x_i\ p_{m;\lambda }(x_i|\sigma ^{(g)}, \mathbf {y}^{(g)}) \mathrm{d}{x_i}. \end{aligned}$$
(14)

However, the task of computing expectations of sums of order statistics under noise disturbance has already been discussed and solved by Arnold in [1]. Therefore the problem of Eq. (13) will be reformulated in order to apply the solutions provided by Arnold.

4.2 Expectations of Sums of Noisy Order Statistics

Let z be a random variate with density \(p_z(z)\) and zero mean. The density is expanded into a Gram-Charlier series by means of its cumulants \(\kappa _i\) (\(i\ge 1\)) according to [1, p. 138, D.15]

$$\begin{aligned} p_z(z) = \frac{1}{\sqrt{2\pi \kappa _2}}\mathrm {e}^{-\frac{z^2}{2\kappa _2}}\left( 1 + \frac{\gamma _1}{6}{\text {He}}_{3}\left( \frac{z}{\sqrt{\kappa _2}}\right) + \frac{\gamma _2}{24}{\text {He}}_{4 }\left( \frac{z}{\sqrt{\kappa _2}}\right) + ...\right) , \end{aligned}$$
(15)

with expectation \(\kappa _1=0\), variance \(\kappa _2\), skewness \(\gamma _1=\kappa _3/\kappa _2^{3/2}\), excess \(\gamma _2=\kappa _4/\kappa _2^2\) (higher order terms not shown) and \(\mathrm {He}_{k}\) denoting the k-th order probabilist’s Hermite polynomials. For the problem at hand, see Eq. (13), the mutation variate \(z \sim \mathcal {N}(0, 1)\) with \(\kappa _2=1\) and \(\kappa _i=0\) for \(i\ne 2\) yielding a standard normal density.

Furthermore, let \(\epsilon \sim \mathcal {N}(0, \sigma _\epsilon ^2)\) model additive noise disturbance, such that resulting observed values are \(v = z + \epsilon \). Selection of the m-th largest out of \(\lambda \) values yields

$$\begin{aligned} v_{m;\lambda } = (z + \mathcal {N}(0, \sigma _\epsilon ^2))_{m;\lambda }, \end{aligned}$$
(16)

and the distribution of selected source terms \(z_{m;\lambda }\) follows a noisy order statistic with density \(p_{m;\lambda }\). Given this definition and a linear relation between \(z_{m;\lambda }\) and \(v_{m;\lambda }\) the method of Arnold is applicable.

In our case the i-th mutation component \(x_{m;\lambda }\) of Eq. (13) is related to selection via the quality change defined in Eq. (3). Maximizing the fitness \(F_i(y_i + x_i)\) conforms to maximizing quality \(Q_i(x_i)\) with \(F_i(y_i)\) being a constant offset.

Aiming at an expression of form (16) and starting with (3), we first isolate component \(Q_i\) from the remaining \(N-1\) components denoted by \(\sum _{j \ne i} Q_j\). Then, approximations are applied to both terms yielding

$$\begin{aligned} Q_{\mathbf {y}}(\mathbf {x})&= Q_i(x_i) + \sum _{j \ne i} Q_j(x_j) \end{aligned}$$
(17)
$$\begin{aligned}&\approx -f'_i x_i + \mathcal {N}(E_i, D_i^2) , \end{aligned}$$
(18)

with linearization (6) applied to \(Q_i(x_i)\). Additionally, \(\sum _{j \ne i} Q_j \simeq \mathcal {N}(E_i, D_i^2)\), as the sum of independent random variables asymptotically approaches a normal distribution in the limit \(N \rightarrow \infty \) due to the Central Limit Theorem. This is ensured by Lyapunov’s condition provided that there are no dominating components within the sum due to largely different values of \(y_j\). The corresponding Rastrigin quality variance \(D_i^2 = \mathrm {Var}\big [\sum _{j \ne i} Q_j(x_j)\big ]\) is calculated in the supplementary material (https://github.com/omam-evo/paper/blob/main/ppsn22/PPSN22_OB22.pdf). As the expectation \(E_i = \mathrm {E}\big [\sum _{j \ne i}Q_j(x_j)\big ]\) is only an offset to \(Q_{\mathbf {y}}(\mathbf {x})\) it has no influence on the selection and its calculation can be dropped.

Using \(x_i = \sigma z_i\) and \(f'_i = {\text {sgn}}\left( f'_i\right) |{f'_i}|\), expression (18) is reformulated as

$$\begin{aligned}&Q_{\mathbf {y}}(\mathbf {x})= -{\text {sgn}}\left( f'_i\right) |{f'_i}|\sigma z_i + E_i + \mathcal {N}(0, D_i^2) \quad \end{aligned}$$
(19)
$$\begin{aligned}&\frac{Q_{\mathbf {y}}(\mathbf {x})-E_i}{|{f'_i}|\sigma } = {\text {sgn}}\left( -f'_i\right) z_i + \mathcal {N}\left( 0, \frac{D_i^2}{(f'_i\sigma )^2}\right) . \end{aligned}$$
(20)

The decomposition using sign function and absolute value is needed for correct ordering of selected values w.r.t. \(z_i\) in (20).

Given result (20), one can define the linearly transformed quality measure \(v_i {:}{=}(Q_{\mathbf {y}}(\mathbf {x})-E_i)/|{f'_i}|\sigma \) and noise variance \(\sigma _\epsilon ^2 {:}{=}(D_i/f'_i\sigma )^2\), such that the selection of mutation component \({\text {sgn}}\left( -f'_i\right) z_i\) is disturbed by a noise term due to the remaining \(N-1\) components. A relation of the form (16) is obtained up to the sign function.

In [1] Arnold calculated the expected value of arbitrary sums \(S_P\) of products of noisy ordered variates containing \(\nu \) factors per summand

$$\begin{aligned} S_P = \sum _{\{n_1, ..., n_\nu \}} z_{n_1;\lambda }^{p_1} \cdots z_{n_\nu ;\lambda }^{p_\nu }, \end{aligned}$$
(21)

with random variate z introduced in Eqs. (15) and (16). The vector \(P = (p_1, ..., p_\nu )\) denotes the positive exponents and distinct summation indices are denoted by the set \(\{n_1, ..., n_\nu \}\). The generic result for the expectation of (21) is provided in [1, p. 142, D.28] and was adapted to account for the sign difference between (16) and (20) resulting in possible exchanged ordering. Performing simple substitutions in Arnold’s calculations in [1] and recalling that in our case \(\gamma _1=\gamma _2=0\), the expected value yields

$$\begin{aligned} \begin{aligned} {\text {E}}\left[ S_P\right]&= {\text {sgn}}\left( -f'_i\right) ^{\Vert {P}_1\Vert } \sqrt{\kappa _2}^{\Vert {P}_1\Vert } \frac{\mu !}{(\mu -\nu )!} \sum _{n=0}^\nu \sum _{k\ge 0} \zeta _{n,0}^{(P)}(k) h_{\mu ,\lambda }^{\nu -n,k}. \end{aligned}\end{aligned}$$
(22)

Note that expression (22) deviates from Arnold’s formula only in the sign in front of \(\sqrt{\kappa _2}\). The coefficients \(\zeta _{n,0}^{(P)}(k)\) are defined in terms of a noise coefficient a according to

$$\begin{aligned} a&= \sqrt{\frac{\kappa _2}{\kappa _2 + \sigma _\epsilon ^2}} \quad \text {with } \zeta _{n,0}^{(P)}(k) = {\text {Polynomial}}(a), \end{aligned}$$
(23)

for which tabulated results are presented in [1, p. 141]. The coefficients \(h_{\mu ,\lambda }^{i,k}\) are numerically obtainable solving

$$\begin{aligned} h_{\mu ,\lambda }^{i,k} = \frac{\lambda -\mu }{\sqrt{2\pi }}\left( {\begin{array}{c}\lambda \\ \mu \end{array}}\right) \int _{-\infty }^{\infty } {\text {He}}_{k}\left( x\right) \mathrm {e}^{-{\frac{1}{2}{x^2}}} [\phi (x)]^i [\varPhi (x)]^{\lambda -\mu -1} [1-\varPhi (x)]^{\mu -i} \mathrm{d}{x}. \end{aligned}$$
(24)

Now we are in the position to calculate expectation (13) using (22). Since \(z \sim \mathcal {N}(0,1)\), it holds \(\kappa _2 = 1\). Identifying \(P=(1)\), \(\big \Vert {P}\big \Vert _1=1\) and \(\nu =1\) yields

$$\begin{aligned} \begin{aligned} {\text {E}}\left[ \sum _{m=1}^{\mu }z_{m;\lambda }\right]&= {\text {sgn}}\left( -f'_i\right) \frac{\mu !}{(\mu -1)!} \sum _{n=0}^1 \sum _{k\ge 0} \zeta _{n,0}^{(1)}(k) h_{\mu ,\lambda }^{1-n,k} \\&= {\text {sgn}}\left( -f'_i\right) \mu \zeta _{0,0}^{(1)}(0) h_{\mu ,\lambda }^{1,0} = -{\text {sgn}}\left( f'_i\right) \mu a c_{\mu /\mu ,\lambda }, \end{aligned}\end{aligned}$$
(25)

with \(\zeta _{1,0}^{(1)}(k) = 0\) for any k, and \(\zeta _{0,0}^{(1)}(k) \ne 0\) only for \(k=0\) yielding a. The expression \(h_{\mu ,\lambda }^{1,0}\) is equivalent to the progress coefficient definition \(c_{\mu /\mu ,\lambda }\) [2, p. 216]. Inserting (25) back into (13), using \(a = \sqrt{1/(1+(D_i/f'_i\sigma )^2)} = |f'_i| \sigma /\sqrt{(f'_i\sigma )^2+D_i^2}\) with the requirement \(a>0\), and noting that \(f'_i = {\text {sgn}}\left( f'_i\right) |f'_i|\) one finally obtains for the i-th component first order progress rate

$$\begin{aligned} \begin{aligned} \varphi _i(\sigma , \mathbf {y}) = c_{\mu /\mu ,\lambda } \frac{f'_i(y_i)\sigma ^2}{\sqrt{(f'_i(y_i)\sigma )^2 + D_i^2( \sigma ,(\mathbf {y})_{j\ne i} )} }. \end{aligned}\end{aligned}$$
(26)

The population dependency is given by progress coefficient \(c_{\mu /\mu ,\lambda }\). The fitness dependent parameters are contained in \(f'_i\), see (7), and in \(D_i^2\) calculated in the supplementary material (https://github.com/omam-evo/paper/blob/main/ppsn22/PPSN22_OB22.pdf). For better readability the derivative \(f'_i\) and variance \(D_i^2\) are not inserted into (26). An exemplary evaluation of \(D_i^2\) as a function of the residual distance R using normalization (10) is also shown in the supplementary material.

4.3 Comparison of Simulation and Approximation

Figure 1 shows an experimentally obtained progress rate compared to the result of (26). Due to large N one exemplary \(\varphi _i\)-graph is shown on the left, and corresponding \(i=1,...,N\) errors are shown on the right.

The left plot shows the progress rate over a \(\sigma \)-range of [0, 1]. This magnitude was chosen in order to study the oscillation, as the frequency \(\alpha =2\pi \). The initial position was chosen randomly to be on the sphere surface \(R=10\).

The red dashed curve uses \(f_i'\) as linearization, while the blue dash-dotted curve assumes \(f_i' = k_i\) (with \(d_i=0\)), see also (7). As \(f_i'\) approximates the quality change locally, agreement for the progress is given only for very small mutations \(\sigma \). For larger \(\sigma \) very large deviation may occur, depending on the local derivative.

The blue curve \(\varphi _i(k_i)\) neglects the oscillation (\(d_i=0\)) and therefore follows the progress of the quadratic function \(f(\mathbf {y}) = \sum _i y_i^2\) for large \(\sigma \) with very good agreement. Due to a linearized form of \(Q_i(x_i)\) in (6) neither approximation can reproduce the oscillation for moderately large \(\sigma \).

To verify the approximation quality, the error between (26) and simulation is displayed on the right side of Fig. 1 for all \(i=1,...,N\). It was done for small \(\sigma =0.1\) and large \(\sigma =1\). The deviations are very similar in magnitude for all i, given randomly chosen \(y_i\). Note that for \(\sigma =1\) the red points show very large errors compared to blue, which was expected.

Figure 2 shows the progress rate \(\varphi _i\) over \(\sigma ^*\), for \(i=2\) as in Fig. 1, with \(\mathbf {y}\) randomly on the surface radii \(R=\{100,10,1,0.1\}\). Using \(\sigma ^*\) the mutation \(\sigma \) is normalized by the residual distance R with spherical normalization (10). Far from the origin with \(R=\{100,10\}\) the quadratic terms are dominating giving better results using \(\varphi _i(k_i)\). Reaching \(R=1\) local minima are more relevant and mixed results are obtained with \(\varphi _i(f_i')\) better for smaller \(\sigma ^*\) and \(\varphi _i(k_i)\) for larger \(\sigma ^*\). Within the global attractor \(R=0.1\) the local structure dominates and \(\varphi _i(f_i')\) yields better results. These observations will be relevant analyzing the dynamics in Fig. 3 where both approximations show strengths and weaknesses.

Fig. 1.
figure 1

One-generation experiments with (150/150, 300)-ES, \(N=100\), \(A=10\) are performed and quantity (11) is measured averaging over \(10^5\) runs. Left: \(\varphi _i\) over \(\sigma \) for \(i=2\) at position \(y_2\approx 1.19\), where \(\mathbf {y}\) was chosen randomly such that \(\Vert {\mathbf {y}}\Vert =R=10\). Right: error measure \(\varphi _i-\varphi _{i, \text {sim}}\) between (26) and simulation for \(i=1,...,N\) evaluated at \(\sigma =\{0.1, 1\}\). The colors are set according to the legend. (Color figure online)

Fig. 2.
figure 2

One-generation progress \(\varphi _i\) \((i=2)\) over normalized mutation \(\sigma ^*\) for (150/150, 300)-ES, \(N=100\), \(A=1\) and \(R=\{100, 10, 1, 0.1\}\). Simulations are averaged over \(10^5\) runs. These experiments are preliminary investigations related to the dynamics shown in Fig. 3 with \(\sigma ^*=30\). Given a constant \(\sigma ^*\) the approximation quality varies over different magnitudes of R.

5 Evolution Dynamics

As we are interested in the dynamical behavior of the ES, averaged real optimization runs from Algorithm 1 will be compared to the iterated dynamics using progress result (26) by applying the dynamical systems approach [2]. Neglecting fluctuations, i.e., \(y_i^{(g+1)} = {\text {E}}\left[ y_i^{(g+1)}\big |{\sigma ^{(g)}, \mathbf {y}^{(g)}}\right] \) the mean value dynamics for the mapping \(y_i^{(g)} \rightarrow y_i^{(g+1)}\) immediately follows from (11) giving

$$\begin{aligned} y_i^{(g+1)} = y_i^{(g)} - \varphi _i(\sigma ^{(g)}, \mathbf {y}^{(g)}). \end{aligned}$$
(27)

The control scheme of \(\sigma ^{(g)}\) was introduced in Eq. (10) and yields simply

$$\begin{aligned} \sigma ^{(g)} = \sigma ^*\Big \Vert {\mathbf {y}^{(g)}}\Big \Vert /N. \end{aligned}$$
(28)

Equations (27) and (28) describe a deterministic iteration in search space and rescaling of mutations according to the residual distance. For a convergence analysis, we are interested in the dynamics of \(R^{(g)}=\big \Vert {\mathbf {y}^{(g)}}\big \Vert \) rather than the actual position values \(\mathbf {y}^{(g)}\). Hence in Fig. 3 the \(R^{(g)}\)-dynamics of the conducted experiments is shown.

Fig. 3.
figure 3

Comparing average of 100 optimization runs of Algorithm 1 (black, solid) with iterated dynamics from Eq. (27) under constant \(\sigma ^*=30\) for \(A=1\) and \(N=100\). Large populations sizes are chosen to ensure global convergence (left: \(\mu =150\); right: \(\mu =1500\); constant \(\mu /\lambda =0.5\)). Iteration using progress (26) is performed for both \(f'_i=k_i+d_i\) (red/orange dashed) and \(f'_i(d_i \!= \! 0)=k_i\) (blue dash-dotted) using Equations (27) and (28). The orange dashed iteration was initialized with \(R^{(0)}=0.1\) and translated to the corresponding position of the simulation for easier comparison. The evaluation of quality variance \(D_i^2(R)\) is shown in the supplementary material (https://github.com/omam-evo/paper/blob/main/ppsn22/PPSN22_OB22.pdf). (Color figure online)

In Fig. 3, all runs of Algorithm 1 exhibit global convergence with the black line showing the average. The left and right plots differ by population size. Iteration \(\varphi _i(k_i)\), blue dash-dotted curve, also converges globally, though very slowly and therefore not shown entirely. The convergence behavior of iteration \(\varphi _i(f'_i)\), red and orange dashed curves, strongly depends on the initialization and is discussed below.

Three phases can be observed for the simulation. It shows linear convergence at first being followed by a slow-down due to local attractors. Reaching the global attractor the convergence speed increases again. Iteration \(\varphi _i(k_i)\) is able to model the first two phases to some degree. Within the global attractor the slope information \(d_i\) is missing such that the progress is largely underestimated.

Iteration \(\varphi _i(f_i')\) converges first, but yields a stationary state with \(R^{st} \approx 20\) when the progress \(\varphi _i\) becomes dominated by derivative term \(d_i\). Starting from \(R^{(0)}=10^2\) the stationary \(y_i^{st}\) are either fixed or alternating between coordinates depending on \(\sigma \), \(D_i\), \(k_i\), and \(d_i\). This effect is due to attraction of local minima and due to the deterministic iteration disregarding fluctuations. It occurs also with varying initial positions. Initialized at \(R^{(0)}=10^{-1}\) orange iteration \(\varphi _i(f_i')\) is globally converging.

It turns out that the splitting point of the two approximations in Fig. 3 occurs at a distance R to the global optimizer where the ES approaches the attractor region of the “first” local minima. For the model parameters considered in the experiment this is at about \(R\approx 28.2\) – the distance of the farest local minimizer to the global optimizer (obtained by numerical analysis).

Plots in Fig. 3 differ by population size. The convergence speed, i.e. the slopes, show better agreement for large populations, which can be attributed to the fluctuations neglected in (27). Investigations on unimodal funtions Sphere [2] and Ellipsoid [4] have shown that progress is decreased by fluctuations due to a loss-term scaling with \(1/\mu \), which agrees with Fig. 3. On the left the iterated progress is faster due to neglected but present fluctuations, while on the right better agreement is observed due to insignificant fluctuations. These observations will be investigated in future research.

6 Summary and Outlook

A first order progress rate \(\varphi _i\) was derived for the \((\mu /\mu _I, \lambda )\text {-ES}\) by means of noisy order statistics in (26) on the Rastrigin function (1). To this end, the mutation induced variance of the quality change \(D_i^2\) is needed. Starting from (4) a derivation yielding \(D_i^2\) has been presented in the supplementary material. Furthermore, the approximation quality of \(\varphi _i\) was investigated using Rastrigin and quadratic derivatives \(f_i'\) and \(k_i\), respectively, by comparing with one-generation experiments.

Linearization \(f_i'\) shows good agreement for small-scale mutations, but very large deviations for large mutations. Conversely, linearization \(k_i\) yields significantly better results for large mutations as the quadratic fitness term dominates. A progress rate modeling the transition between the regimes is yet to be determined. First numerical investigations of (14) including all terms of (4) indicate that nonlinear terms are needed for a better progress rate model, which is an open challenge and part of future research.

The obtained progress rate was used to investigate the dynamics by iterating (27) using (28) and comparing with ES runs. Iteration via \(f_i'\) only converges globally if initialized close to the optimizer, since local attraction is strongly dominating. Dynamics via \(k_i\) converges globally independent of initialization, but the observed rate matches only for the initial phase and for very large populations. This confirms the need for a higher order progress rate modeling the effect of fluctuations, especially when function evaluations are expensive and small populations must be used. Additionally, an advanced progress rate formula is needed combining effects of global and local attraction to model all three phases of the dynamics correctly.

The investigations done so far are a first step towards a full dynamical analysis of the ES on the multimodal Rastrigin function. Future investigations must also include the complete dynamical modeling of the mutation strength control. One aim is the tuning of mutation control parameters such that the global convergence probability is increased while still maintaining search efficiency. Our final goal will be the theoretical analysis of the full evolutionary process yielding also recommendations regarding the choice of the minimal population size needed to converge to the global optimizer with high probability.